[GitHub] [druid] LakshSingla commented on pull request #12139: Limit the subquery results by memory usage (estimated)

GitBox Tue, 11 Jan 2022 06:18:23 -0800


LakshSingla commented on pull request #12139:
URL: https://github.com/apache/druid/pull/12139#issuecomment-1010007702



   Thanks for the review!
   
   > this seems pretty useful, but also looks rather expensive since this is 
going to happen for every row. Could you measure the performance before and 
after this change? this benchmark might be a good place to start
   
   Yes, this would have a performance overhead on the queries when they are 
running. Also, I noticed no difference in running 
`CachingClusteredClientBenchmark` on this branch, v/s the master branch, on my 
laptop (31mins16s vs 31mins respectively). Is there any other benchmark which 
is specific to the `ClientQuerySegmentWalker`? Let me take a look around as well
   
   > Also, there appears to be no way to disable it, maybe it should be 
possible to set the limit to 0 to disable this computation instead of setting 
the limit to max long value?
   
   Nice idea! Will incorporate this. 
   
   > Should it prove expensive, maybe the approach should be to just sample the 
first 'n' rows and use whatever the average estimated size is for any remaining 
rows instead of trying to estimate every row encountered. I imagine the loss of 
accuracy would be worth how much cheaper it would be to not have to loop every 
column of every row for all rows.
   
   This seems like a good middle ground of performance versus accuracy. But how 
would this fare for Strings, or columns of type arrays? Instead of taking the 
average, should the imin/max of the first n samples be considered to mark a 
lower bound/upper bound on the memory usage? Also, would sampling the first 'n' 
be a correct assumption about the size of the rest, or could we probably do 
better by skipping after a while, and picking randomly? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] LakshSingla commented on pull request #12139: Limit the subquery results by memory usage (estimated)

Reply via email to