leerho commented on PR #554:
URL:
https://github.com/apache/datasketches-java/pull/554#issuecomment-2087730974
Zac, Thank you! This is very helpful.
I am not discarding your suggested improvements to the ArrayOfStringsSerDe.
The first one is especially important and we will definitely integrate that in
the next release.
The second one is more of a demonstration of where the problem is. Of course
you are free to extend these ArrayOf*SerDe classes to suit. Substituting the
char[].length for the UTF-8 byte[].length is only useful in English speaking
countries where ASCII predominates. Nonetheless, there are lots of situations
where this assumption will not work.
As to your suggestions in the PR, I will have to study them more closely.
Some I think are OK, others I will have to think about.
****
Back to your description of your use-case. Allow me to try to play back to
you what I am gathering from your description:
- Start with a very large table with ~1T (1e12) rows, each row can have
several hundred columns.
- "With a string-size of 64 bytes and 1T rows, you're looking at 64T of
data. Across 100M groups that's roughly 640k of data per group and 100K items
per sketch in the group."
- Doesn't this assume you are analyzing only one column?
- Nonetheless, this data per group will vary widely and likely be power-law
distributed. (a few groups with millions of rows, and millions of groups with
only a few rows).
It is my understanding that it is at this group level where you want to
analyze the distribution of items, where you might have a KLL sketch per
column, thus many sketches just for this group. And, of course, all the 100M
groups would be configured with sketches as well implying a huge number of
sketches for the entire query analysis -- thus the concern about memory usage.
Assuming that the above is roughly right, here is where I am a little
confused (I am not a DB engine expert by any means!) Above you mentioned:
> The main objective of using these sketches is to estimate the result of
predicates=, !=, <, <=, >, >= etc applied to columns on a table which enables
the optimizer's CBO to create better query plans.
- Can you give me an example of how you use the distribution information
from these sketches along with the user chosen predicates to help you with
query optimization?
- What are you looking for in the distributions from these sketches?
- What decisions can you (or your query planning model ) make from these
distributions?
_Note: If you are looking for Heavy-Hitters the Frequent Items sketch might
be very useful._
_Note: If you are converting these sketches into PMF histograms, I should
caution you about trying to used too many split-points for the given K of the
sketch, because you can easily exceed the error threshold of the sketch
(producing garbage). Our most recent release 6.0.0 has some extra functions to
help you with this._
*****
Given your estimate of an average of 64B per item, this is what the KLL
Growth Path might look like:
<img width="475" alt="NormalizedGrowth_64B_K200"
src="https://github.com/apache/datasketches-java/assets/12941506/1251b778-a877-4009-a30b-533424403e3e">
This is obviously much bigger than 5KB! This is just the previous graph,
which was normalized to one byte, times 64.
Cheers,
Lee.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]