[GitHub] [druid] gianm commented on issue #12261: Accessing private parts of datasketches

GitBox Tue, 15 Feb 2022 12:53:06 -0800


gianm commented on issue #12261:
URL: https://github.com/apache/druid/issues/12261#issuecomment-1040781523



   > PLEASE, engage with us and help us understand what it is you are trying to 
do. If it means adding a public API to accomplish it, we are more than willing 
to do that.
   
   Please don't take it personally that nobody reached out to you about the 
original problem. The Druid project is a big place with a lot of dependencies, 
and not all contributors are aware of how willing those various dependencies' 
maintainers are to work with us. Your willingness to work with us over the 
years is very much appreciated, as is the fact that you took the time to write 
up your concerns about this particular case.
   
   > The memory currently being used by a Union or Sketch is rather complex.
   
   The commit is from #12073, #12022 which are trying to improve memory 
estimation in the on-heap indexing implementation. The goal of the patch is to 
decide when to spill an on-heap index to disk. To do this properly, we need a 
way to estimate the memory usage of the index. The patch works by adding an 
`aggregateWithSize()` method to on-heap aggregators that (1) does an 
aggregation and (2) returns the amount of additional memory use caused by that 
particular aggregation. Each aggregator is associated with a particular row in 
the index, so we then sum up the memory used by all aggregators in all rows to 
get the memory footprint of the entire index.
   
   The private field in this case is being used by SketchAggregator to 
implement the memory-estimation piece of `aggregateWithSize()`.
   
   We shouldn't be using private fields like that, so we should find another 
way to solve the original problem.
   
   One obvious solution I can think of is to use a simple formula, like 
something that is based only on the `size` parameter and the number of items 
added so far. I'm not sure how accurate this would be since I'm not super 
familiar with how the sketches internally manage their on-heap memory. But 
maybe it'd be close enough? If we do go this way, an implementation note: at 
ingestion time, it's likely that a small number of items get added to each 
sketch. (Like in the 1 to 50 range.) I think some sketches have a special mode 
they use when the number of added items is low, and whenever that is true, we'd 
want to take that into account to avoid overcounting memory usage.
   
   @AlexanderSaydakov or @leerho do you have any other, better suggestions? & 
@kfaraz could you please work with them to implement a better solution when we 
come up with one?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm commented on issue #12261: Accessing private parts of datasketches

Reply via email to