leerho commented on issue #6865: Densify swapped hll buffer URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-464409814 @clintropolis Sorry about the delay, I was pulled off onto other problems :) This code has so many serious issues that I hardly know where to start. If we are going to submit fixes, then I would suggest that we fix an issue throughout the code and not just in one place. - Checking for sparse vs dense Specifically, your added check if sparse in the fold() routine may be OK on the surface, but one of the serious problems in the code is that the author uses a very fragile method for determining whether the sketch is sparse or dense and it is done differently in different parts of the code. (It should have been a simple boolean in the header!) In your check you compare (and in other parts the author does this as well ...) `if (other.storageBuffer.remaining() != other.getNumBytesForDenseStorage())` The `isSparce(buffer)` method compares: `buffer.remaining() != NUM_BYTES_FOR_BUCKETS;` They are different! NUM_BYTES_FOR_BUCKETS = 1024, while getNumBytesForDenseStorage() = 1031. The author should have used `isSparse(buffer)` everywhere. It is fragile because the remaining size of the buffer relies on the current state of the buffer position and limit. Yikes! And `isSparse(buffer)` sometimes fails as follows: One of the tests you added demonstrates this. In `testRegisterSwapWithSparse()` after you artificially fill all the registers with 1's, when you check the cardinality it returns zero and you say it is "fine". It is not "fine". It is a symptom of more serious issues. The reason it returns zero is because the toByteBuffer() returns a sparse representation when it should be dense, and estimateCardinality() of sparse uses the linear (or Poisson) estimator which is `k log (k/v)` where v = k, and the estimate is zero! The sketch should not have been sparse and even if it was sparse, the linear estimator should never be used when v approaches k! These issues could be fixed, but it will require a version change and a lot of careful coding. I have more, but I have to go now.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
