leerho commented on issue #6865: Densify swapped hll buffer
URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-464409814
 
 
   @clintropolis 
   
   Sorry about the delay, I was pulled off onto other problems :)
   
   This code has so many serious issues that I hardly know where to start.  If 
we are going to submit fixes, then I would suggest that we fix an issue 
throughout the code and not just in one place.
   
   - Checking for sparse vs dense
   Specifically, your added check if sparse in the fold() routine may be OK on 
the surface, but one of the serious problems in the code is that the author 
uses a very fragile method for determining whether the sketch is sparse or 
dense and it is done differently in different parts of the code.  (It should 
have been a simple boolean in the header!)
   
   In your check you compare (and in other parts the author does this as well 
...)
   `if (other.storageBuffer.remaining() != other.getNumBytesForDenseStorage())`
   
   The `isSparce(buffer)` method compares:  `buffer.remaining() != 
NUM_BYTES_FOR_BUCKETS;`
   
   They are different!  NUM_BYTES_FOR_BUCKETS = 1024, while 
getNumBytesForDenseStorage() = 1031.
   
   The author should have used `isSparse(buffer)` everywhere.
   
   It is fragile because the remaining size of the buffer relies on the current 
state of the buffer position and limit.  Yikes!    
   
   And `isSparse(buffer)` sometimes fails as follows:
   
   One of the tests you added demonstrates this. In 
`testRegisterSwapWithSparse()` after you artificially fill all the registers 
with 1's, when you check the cardinality it returns zero and you say it is 
"fine".  
   It is not "fine".  It is a symptom of more serious issues.
   
   The reason it returns zero is because the toByteBuffer() returns a sparse 
representation when it should be dense, and  estimateCardinality() of sparse 
uses the linear (or Poisson) estimator which is `k log (k/v)` where v = k, and 
the estimate is zero!   The sketch should not have been sparse and even if it 
was sparse, the linear estimator should never be used when v approaches k!  
   
   These issues could be fixed, but it will require a version change and a lot 
of careful coding.
   
   I have more, but I have to go now.
   
   
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to