swaminathanmanish commented on PR #12042: URL: https://github.com/apache/pinot/pull/12042#issuecomment-1839183839
> > Thanks for the detailed description and the reasoning behind this approach. Intuitively this approach make sense. I took a first pass of the PR and have some high level questions > > > > 1. There are 4 new params introduced. Have we quantified the gains for each of these params and which one yields the largest gains? Im assuming these params work independent of each other. > > 2. We have nominal entries param for a sketch (which is the number of entries in a sketch?). Curious if we have already experimented tuning this param to figure out the gains ? How can this parameter impact performance. > > 3. Could you share the benchmark results/numbers for different values of the params > > Thank you for your review @swaminathanmanish. I found your questions very insightful. I should make it clear at the outset that it has been difficult to thoroughly benchmark the work in this pull request in a test environment. Some of the performance improvements are knobs to turn and are speculative in how they could behave. > > Therefore I would like to propose that the parameters are retained and tested in a production environment, and then subsequently removed should they show no user benefit. The downside to this approach is that backward compatibility will not be maintained should end users start to depend on them and use them. As for your questions, answers follow inline. > > > 1. There are 4 new params introduced. Have we quantified the gains for each of these params and which one yields the largest gains? Im assuming these params work independent of each other. > > The largest gain can be realised through adjusting the sampling probability parameter. However, this is highly dependent on the use case and should only be used in certain circumstances. All parameters are independent of each other and have been selected to have default behaviour that retains the existing behaviour of the system. The performance gains measured in testing are between 25% and 50% performance improvement. Where the results are sampled, the speed increases by 300%. > > > 2. We have nominal entries param for a sketch (which is the number of entries in a sketch?). Curious if we have already experimented tuning this param to figure out the gains ? How can this parameter impact performance. > > We have been using this extensively, but it does not help where there is a large tail of sketches that have retained items less than nominal entries. For these cases, selecting lower nominal entries can impact the error across the board. Instead, using sampling probability allows the end user to curtail size and increase error on the long tail, which, for some reports and queries, is a good tradeoff. > > > 3. Could you share the benchmark results/numbers for different values of the params > > I'd like to do more testing in a production environment. But, the default set of parameters can be up to 25% faster than the current implementation (through sorting and retaining unions). However, adjusting others such as sampling has shown orders of magnitude gains - but, there are certain tradeoffs on accuracy that need to be considered. Thanks for the detailed response @davecromberge -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
