adarshsanjeev opened a new issue, #478: URL: https://github.com/apache/datasketches-java/issues/478
Recently, we found a performance issue while using data sketches through Apache Druid. There was some slowness while running aggregations which merge HllSketches to get a final count. Looking through some flame graphs, a lot of the time seems to be spent checking if the sketches are empty. In that particular data, it seems that there were a large number of empty sketches. This resulted in each sketch being deserialized from a byte array before calling an isEmpty() check. This is something that could be avoided since merging an empty sketch is a no-op, and a way to check if this is the case without first deserializing the sketch entirely might help here. On adding [a custom check](https://github.com/apache/druid/pull/15162/files#diff-4211c731d7f7118cd3449bba2253870b97a1a83ed2feac8521a3d1246aa4c53cR87) on the byte array without deserializing (by checking the isEmpty flag in the header for sketch implementation, and number of elements byte for the set and list implementations) to check if the sketch is empty, and saw performance improvements (11 seconds to 9.7 seconds on 10M empty sketches being merged, with only this change). Is there some scope for a check like this to be added to the data sketches library? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
