leerho commented on PR #554:
URL: 
https://github.com/apache/datasketches-java/pull/554#issuecomment-2084138258

   Hi Zac, 
   Thank you very much for your suggestion! It really helps us understand the 
issues you face in using our sketches and how we can improve the library!
   
   The first concern that jumps out at me is that your specific problem and the 
solution you propose is focused on items as Strings. Although I recognize that 
Strings may be a dominant use case for the items sketches, it is not the only 
application, as items sketches can be used for arbitrary objects.  So I am 
hesitant to put into the sketch a lot of special code that is only useful for 
Strings. 
   
   That being said, let me ask you some questions and explore some other ideas 
with you. In your write-up above you said:
   
   > ...the system needs to track roughly how much memory an aggregation is 
using.
   
   - This gives me hope!  Nonetheless, how "rough" is this need?  +/- 1%, 10%, 
50% ? Because I think we can come up with a good approximation of the memory 
usage that may be much simpler than wedging a bunch of code for a specific use 
case into the sketch itself.
   
   - You gave some numbers above.  Allow me to noodle with them (although I'm 
sure you did not want me to do this! )
       - Total data: 100TB = 1e14 bytes
       - Rows: 1T = 1e12 => ~ 100 bytes / row
       - String items per row ? I assume 1 ?
       - Groups: 100M = 1e8 groups => ~ 1e4 rows / group => 1e6 bytes / group
   
   - Do these extrapolations sound about right?  If you are not sure, would it 
be possible for you to characterize your string length, row size, and group 
size distributions of your raw data?  This would be just three 
KllFloatsSketches, one each for capturing the string lengths, the row sizes and 
the group sizes.   Easier said than done, I'm sure, but this will be very 
useful, as I'll explain.
   
   But first:
   
   - What value of K are you using for your sketches now? And what is your 
requirement for rank accuracy that led you to choose this value of K?
   - What are you using the sketches for?  
       - Data partitioning perhaps?  If so, we have a whole lot more to discuss!
       - Anomaly detection?
       - How does knowing the data distribution per group (per day, per 
whatever?) help you?
   - I assume, although you didn't say exactly, that having an estimate of how 
much memory your aggregation functions are using is important for memory 
planning and partitioning, and more important than the fact that they are using 
memory. Is that correct?
   
   --IDEA--
   Sketches are statistical animals, so lets try to take advantage of 
statistics of your data to get approximate, yet useful estimations of the data 
you need for your memory planning.
   
   The KLL sketches have a very well defined growth plan that looks something 
like the following. The red curve is for KllDoublesSketch and the blue curve is 
for KllFloatsSketch.
   
   ![GrowthFloats 
Doubles_K200](https://github.com/apache/datasketches-java/assets/12941506/eb5c0a4a-1aec-45f8-9d7e-313dbe58b894)
   
   Normalizing these two curves by the size of a double (8 bytes) and a float 
(4 bytes) produces this curve:
   
   <img width="646" alt="NormalizedGrowth_K200b" 
src="https://github.com/apache/datasketches-java/assets/12941506/8bd84d4e-3c5f-49fa-8326-7a3912bce20c";>
   
   The message from these graphs is that once KLL sketches reach a certain size 
they grow very, very slowly. As you can see, from the red KllDoublesSketch 
curve, from 1M input doubles to 1B input doubles, the sketch only grows from 
5KB to about 5.7KB or about 700 bytes.  And going from 1B input doubles to 1T 
input doubles will cause the size to increase by  another 700 bytes!  This is 
because that final slope remains constant.
   
   Now with the second normalized curve above, and a distribution of item sizes 
extracted from your actual data, we can create a statistical model that pretty 
closely estimates the actual memory usage of a KllItemsSketch being fed the 
same data.  This approach could be used no matter what the item is, and it can 
be done while the KllItemsSketch is being feed data.  This might mean having 
two sketches side by side: A floats sketch tracking the distribution of item 
sizes, and the actual items sketch tracking the distribution of the items 
themselves.
   
   Would you be willing to try something like this?
   
   Cheers,
   Lee.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to