leerho commented on PR #554:
URL:
https://github.com/apache/datasketches-java/pull/554#issuecomment-2084138258
Hi Zac,
Thank you very much for your suggestion! It really helps us understand the
issues you face in using our sketches and how we can improve the library!
The first concern that jumps out at me is that your specific problem and the
solution you propose is focused on items as Strings. Although I recognize that
Strings may be a dominant use case for the items sketches, it is not the only
application, as items sketches can be used for arbitrary objects. So I am
hesitant to put into the sketch a lot of special code that is only useful for
Strings.
That being said, let me ask you some questions and explore some other ideas
with you. In your write-up above you said:
> ...the system needs to track roughly how much memory an aggregation is
using.
- This gives me hope! Nonetheless, how "rough" is this need? +/- 1%, 10%,
50% ? Because I think we can come up with a good approximation of the memory
usage that may be much simpler than wedging a bunch of code for a specific use
case into the sketch itself.
- You gave some numbers above. Allow me to noodle with them (although I'm
sure you did not want me to do this! )
- Total data: 100TB = 1e14 bytes
- Rows: 1T = 1e12 => ~ 100 bytes / row
- String items per row ? I assume 1 ?
- Groups: 100M = 1e8 groups => ~ 1e4 rows / group => 1e6 bytes / group
- Do these extrapolations sound about right? If you are not sure, would it
be possible for you to characterize your string length, row size, and group
size distributions of your raw data? This would be just three
KllFloatsSketches, one each for capturing the string lengths, the row sizes and
the group sizes. Easier said than done, I'm sure, but this will be very
useful, as I'll explain.
But first:
- What value of K are you using for your sketches now? And what is your
requirement for rank accuracy that led you to choose this value of K?
- What are you using the sketches for?
- Data partitioning perhaps? If so, we have a whole lot more to discuss!
- Anomaly detection?
- How does knowing the data distribution per group (per day, per
whatever?) help you?
- I assume, although you didn't say exactly, that having an estimate of how
much memory your aggregation functions are using is important for memory
planning and partitioning, and more important than the fact that they are using
memory. Is that correct?
--IDEA--
Sketches are statistical animals, so lets try to take advantage of
statistics of your data to get approximate, yet useful estimations of the data
you need for your memory planning.
The KLL sketches have a very well defined growth plan that looks something
like the following. The red curve is for KllDoublesSketch and the blue curve is
for KllFloatsSketch.

Normalizing these two curves by the size of a double (8 bytes) and a float
(4 bytes) produces this curve:
<img width="646" alt="NormalizedGrowth_K200b"
src="https://github.com/apache/datasketches-java/assets/12941506/8bd84d4e-3c5f-49fa-8326-7a3912bce20c">
The message from these graphs is that once KLL sketches reach a certain size
they grow very, very slowly. As you can see, from the red KllDoublesSketch
curve, from 1M input doubles to 1B input doubles, the sketch only grows from
5KB to about 5.7KB or about 700 bytes. And going from 1B input doubles to 1T
input doubles will cause the size to increase by another 700 bytes! This is
because that final slope remains constant.
Now with the second normalized curve above, and a distribution of item sizes
extracted from your actual data, we can create a statistical model that pretty
closely estimates the actual memory usage of a KllItemsSketch being fed the
same data. This approach could be used no matter what the item is, and it can
be done while the KllItemsSketch is being feed data. This might mean having
two sketches side by side: A floats sketch tracking the distribution of item
sizes, and the actual items sketch tracking the distribution of the items
themselves.
Would you be willing to try something like this?
Cheers,
Lee.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]