adarshsanjeev opened a new issue, #478:
URL: https://github.com/apache/datasketches-java/issues/478

   Recently, we found a performance issue while using data sketches through 
Apache Druid.
   
   There was some slowness while running aggregations which merge HllSketches 
to get a final count. Looking through some flame graphs, a lot of the time 
seems to be spent checking if the sketches are empty.
   
   In that particular data, it seems that there were a large number of empty 
sketches. This resulted in each sketch being deserialized from a byte array 
before calling an isEmpty() check. This is something that could be avoided 
since merging an empty sketch is a no-op, and a way to check if this is the 
case without first deserializing the sketch entirely might help here.
   
   On adding [a custom 
check](https://github.com/apache/druid/pull/15162/files#diff-4211c731d7f7118cd3449bba2253870b97a1a83ed2feac8521a3d1246aa4c53cR87)
 on the byte array without deserializing (by checking the isEmpty flag in the 
header for sketch implementation, and number of elements byte for the set and 
list implementations) to check if the sketch is empty, and saw performance 
improvements (11 seconds to 9.7 seconds on 10M empty sketches being merged, 
with only this change). 
   
   Is there some scope for a check like this to be added to the data sketches 
library? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to