swaminathanmanish commented on PR #12042:
URL: https://github.com/apache/pinot/pull/12042#issuecomment-1839183839

   > > Thanks for the detailed description and the reasoning behind this 
approach. Intuitively this approach make sense. I took a first pass of the PR 
and have some high level questions
   > > 
   > > 1. There are 4 new params introduced. Have we quantified the gains for 
each of these params and which one yields the largest gains? Im assuming these 
params work independent of each other.
   > > 2. We have nominal entries param for a sketch (which is the number of 
entries in a sketch?). Curious if we have already experimented tuning this 
param to figure out the gains ?  How can this parameter impact performance.
   > > 3. Could you share the benchmark results/numbers for different values of 
the params
   > 
   > Thank you for your review @swaminathanmanish. I found your questions very 
insightful. I should make it clear at the outset that it has been difficult to 
thoroughly benchmark the work in this pull request in a test environment. Some 
of the performance improvements are knobs to turn and are speculative in how 
they could behave.
   > 
   > Therefore I would like to propose that the parameters are retained and 
tested in a production environment, and then subsequently removed should they 
show no user benefit. The downside to this approach is that backward 
compatibility will not be maintained should end users start to depend on them 
and use them. As for your questions, answers follow inline.
   > 
   > > 1. There are 4 new params introduced. Have we quantified the gains for 
each of these params and which one yields the largest gains? Im assuming these 
params work independent of each other.
   > 
   > The largest gain can be realised through adjusting the sampling 
probability parameter. However, this is highly dependent on the use case and 
should only be used in certain circumstances. All parameters are independent of 
each other and have been selected to have default behaviour that retains the 
existing behaviour of the system. The performance gains measured in testing are 
between 25% and 50% performance improvement. Where the results are sampled, the 
speed increases by 300%.
   > 
   > > 2. We have nominal entries param for a sketch (which is the number of 
entries in a sketch?). Curious if we have already experimented tuning this 
param to figure out the gains ?  How can this parameter impact performance.
   > 
   > We have been using this extensively, but it does not help where there is a 
large tail of sketches that have retained items less than nominal entries. For 
these cases, selecting lower nominal entries can impact the error across the 
board. Instead, using sampling probability allows the end user to curtail size 
and increase error on the long tail, which, for some reports and queries, is a 
good tradeoff.
   > 
   > > 3. Could you share the benchmark results/numbers for different values of 
the params
   > 
   > I'd like to do more testing in a production environment. But, the default 
set of parameters can be up to 25% faster than the current implementation 
(through sorting and retaining unions). However, adjusting others such as 
sampling has shown orders of magnitude gains - but, there are certain tradeoffs 
on accuracy that need to be considered.
   
   Thanks for the detailed response @davecromberge 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to