LakshSingla commented on pull request #12139: URL: https://github.com/apache/druid/pull/12139#issuecomment-1010007702
Thanks for the review! > this seems pretty useful, but also looks rather expensive since this is going to happen for every row. Could you measure the performance before and after this change? this benchmark might be a good place to start Yes, this would have a performance overhead on the queries when they are running. Also, I noticed no difference in running `CachingClusteredClientBenchmark` on this branch, v/s the master branch, on my laptop (31mins16s vs 31mins respectively). Is there any other benchmark which is specific to the `ClientQuerySegmentWalker`? Let me take a look around as well > Also, there appears to be no way to disable it, maybe it should be possible to set the limit to 0 to disable this computation instead of setting the limit to max long value? Nice idea! Will incorporate this. > Should it prove expensive, maybe the approach should be to just sample the first 'n' rows and use whatever the average estimated size is for any remaining rows instead of trying to estimate every row encountered. I imagine the loss of accuracy would be worth how much cheaper it would be to not have to loop every column of every row for all rows. This seems like a good middle ground of performance versus accuracy. But how would this fare for Strings, or columns of type arrays? Instead of taking the average, should the imin/max of the first n samples be considered to mark a lower bound/upper bound on the memory usage? Also, would sampling the first 'n' be a correct assumption about the size of the rest, or could we probably do better by skipping after a while, and picking randomly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
