[I] Possible improvement for quick isEmpty check on sketches (datasketches-java)

via GitHub Wed, 22 Nov 2023 03:50:31 -0800


adarshsanjeev opened a new issue, #478:
URL: https://github.com/apache/datasketches-java/issues/478

Recently, we found a performance issue while using data sketches through
Apache Druid.

There was some slowness while running aggregations which merge HllSketches
to get a final count. Looking through some flame graphs, a lot of the time
seems to be spent checking if the sketches are empty.

In that particular data, it seems that there were a large number of empty
sketches. This resulted in each sketch being deserialized from a byte array
before calling an isEmpty() check. This is something that could be avoided
since merging an empty sketch is a no-op, and a way to check if this is the
case without first deserializing the sketch entirely might help here.

On adding [a custom
check](https://github.com/apache/druid/pull/15162/files#diff-4211c731d7f7118cd3449bba2253870b97a1a83ed2feac8521a3d1246aa4c53cR87)
on the byte array without deserializing (by checking the isEmpty flag in the
header for sketch implementation, and number of elements byte for the set and
list implementations) to check if the sketch is empty, and saw performance
improvements (11 seconds to 9.7 seconds on 10M empty sketches being merged,
with only this change).

Is there some scope for a check like this to be added to the data sketches
library?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Possible improvement for quick isEmpty check on sketches (datasketches-java)

Reply via email to