davecromberge commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1822891037

   Notes for reviewers:
   ===============
   
   The changes in this PR introduce a different intermediate result type for 
the DistinctCountThetaSketch aggregation function.  The reason for doing so is 
to eliminate the intermediate bookkeeping necessary to perform unions on two 
sketches at a time.   
    Moreover, the "early-stop" optimisation in set operation circumvents 
further processing when retained items fall above the minimum theta value.  
This applies to other post set operation expressions as well.
   
   When using JMH benchmarks to simulate this scenario, the speedup achieved by 
accumulating sketches prior to union is often an improvement by a factor of 3.  
   However, I have not been able to observe this speedup in actual test queries 
due to limitations in my test environment.  Instead, this will become more 
important when sketches require merging from thousands of segments on a single 
server.
   
   Alternatives considered
   ------------------------
   
   1.  Only introducing additional parameters - this avoids the complexity in 
accumulating intermediate sketches and still gives the end user the option to 
use p-sampling.
   
   2. Using Union as the intermediate type.  The Datasketches library exposes 
the ability to serialize a union as bytes but does not expose the capability to 
use `fastWrap` which deserialises the union back into an object on the heap.  
This would be simpler than the approach in this PR and if desired I can work 
with the Datasketches team to add this capability.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to