richardstartin edited a comment on pull request #7828:
URL: https://github.com/apache/pinot/pull/7828#issuecomment-978891304


   >  I actually considered that, and decided to go with the snapshot approach 
because we want to optimize the query runtime method as much as possible
   
   I don’t think this is likely to be an optimisation unless the removeAll call 
is made frequently and the number of empty segments is large.
   
   I think when an optimisation makes the code much harder to read we need 4 
bits of data:
   
   1. **evidence that there is a problem** - e.g. wall time thresholded tracing 
events around EmptySegmentPruner.prune, or a sampling profiler reporting lots 
of samples in this method
   2. **evidence that the thing being optimised solves the problem** - assuming 
we have data that there is a problem, is it the removeAll call (which using a 
HashSet may optimise relative to a concurrent variant) or the copy on line 151 
which is the cause?
   3. **evidence the problem is common** - e.g. statistics about number of 
empty segments from a real cluster, do we have a metric for this?
   4. **evidence the optimisation is effective** - for widely observed set 
sizes, does it actually make a difference to have a HashSet instead of a 
concurrent HashSet?
   
   I raise this only because the code is *much* harder to read than if it used 
a threadsafe Set. If it were as readable I would be inclined to assume that the 
optimisation is effective.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to