leerho commented on PR #554:
URL: 
https://github.com/apache/datasketches-java/pull/554#issuecomment-2087730974

   Zac, Thank you!  This is very helpful.  
   
   I am not discarding your suggested improvements to the ArrayOfStringsSerDe.  
The first one is especially important and we will definitely integrate that in 
the next release.  
   
   The second one is more of a demonstration of where the problem is. Of course 
you are free to extend these ArrayOf*SerDe classes to suit. Substituting the 
char[].length for the UTF-8 byte[].length is only useful in English speaking 
countries where ASCII predominates.  Nonetheless, there are lots of situations 
where this assumption will not work.  
   
   As to your suggestions in the PR, I will have to study them more closely.  
Some I think are OK, others I will have to think about.
   
   ****
   Back to your description of your use-case.  Allow me to try to play back to 
you what I am gathering from your description:
   
   - Start with a very large table with ~1T (1e12) rows, each row can have 
several hundred columns.
   - "With a string-size of 64 bytes and 1T rows, you're looking at 64T of 
data. Across 100M groups that's roughly 640k of data per group and 100K items 
per sketch in the group."
       - Doesn't this assume you are analyzing only one column?   
   - Nonetheless, this data per group will vary widely and likely be power-law 
distributed. (a few groups with millions of rows, and millions of groups with 
only a few rows).
   
   It is my understanding that it is at this group level where you want to 
analyze the distribution of items, where you might have a KLL sketch per 
column, thus many sketches just for this group.  And, of course, all the 100M 
groups would be configured with sketches as well implying a huge number of 
sketches for the entire query analysis  -- thus the concern about memory usage.
   
   Assuming that the above is roughly right, here is where I am a little 
confused (I am not a DB engine expert by any means!)  Above you mentioned:
   
   > The main objective of using these sketches is to estimate the result of 
predicates=, !=, <, <=, >, >= etc applied to columns on a table which enables 
the optimizer's CBO to create better query plans.
   
   - Can you give me an example of how you use the distribution information 
from these sketches along with the user chosen predicates to help you with 
query optimization? 
   - What are you looking for in the distributions from these sketches?
   - What decisions can you (or your query planning model ) make from these 
distributions?
   
   
   
   _Note: If you are looking for Heavy-Hitters the Frequent Items sketch might 
be very useful._
   
   _Note: If you are converting these sketches into PMF histograms, I should 
caution you about trying to used too many split-points for the given K of the 
sketch, because you can easily exceed the error threshold of the sketch 
(producing garbage).  Our most recent release 6.0.0 has some extra functions to 
help you with this._  
   
   *****
   Given your estimate of an average of 64B per item, this is what the KLL 
Growth Path might look like:
   
   <img width="475" alt="NormalizedGrowth_64B_K200" 
src="https://github.com/apache/datasketches-java/assets/12941506/1251b778-a877-4009-a30b-533424403e3e";>
   
   This is obviously much bigger than 5KB!  This is just the previous graph, 
which was normalized to one byte, times 64.  
   
   Cheers,
   Lee.
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to