davecromberge commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1822891037
Notes for reviewers:
===============
The changes in this PR introduce a different intermediate result type for
the DistinctCountThetaSketch aggregation function. The reason for doing so is
to eliminate the intermediate bookkeeping necessary to perform unions on two
sketches at a time.
Moreover, the "early-stop" optimisation in set operation circumvents
further processing when retained items fall above the minimum theta value.
This applies to other post set operation expressions as well.
When using JMH benchmarks to simulate this scenario, the speedup achieved by
accumulating sketches prior to union is often an improvement by a factor of 3.
However, I have not been able to observe this speedup in actual test queries
due to limitations in my test environment. Instead, this will become more
important when sketches require merging from thousands of segments on a single
server.
Alternatives considered
------------------------
1. Only introducing additional parameters - this avoids the complexity in
accumulating intermediate sketches and still gives the end user the option to
use p-sampling.
2. Using Union as the intermediate type. The Datasketches library exposes
the ability to serialize a union as bytes but does not expose the capability to
use `fastWrap` which deserialises the union back into an object on the heap.
This would be simpler than the approach in this PR and if desired I can work
with the Datasketches team to add this capability.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]