leerho commented on PR #554:
URL: 
https://github.com/apache/datasketches-java/pull/554#issuecomment-2089341949

   Thanks Zac, this is super helpful.  I was not aware that you offer sketches 
for use by your DB-users.  That is wonderful!   Now I realize you have two use 
cases: one for User Queries (UQ), and one for Query Planning and Optimization 
(QPO) used by the DB developers.  
   
   Is it fair to assume that the QPO task is always performed prior to any UQ? 
In other words, you have the opportunity to learn a lot about the user's tables 
prior to the user using them.
   
   If this is true, then from the QPO pass, you could have in advance of any UQ 
the **_distribution of element sizes_** of any column the user might select.  
(I would hope that this kind of table metadata could be constructed as a table 
is being built, but that is a different discussion.) 
   
   As I have shown you in several graphs, the average size of KLL sketch is 
quite predictable from _**k**_ the configured sketch size, **_n_** the total 
number of elements fed to the sketch and **_s_** the size of the elements fed 
to the sketch.  If _**s**_ is a random variable, then all we need is the CDF.   
Most importantly, we can establish upper bounds on both of those values.  The 
most conservative approach would be to set _**n**_ to be the number of rows of 
the entire table and _**s**_ to be the largest element in the column of 
interest.  From that we can predict pretty accurately (and very quickly) what 
the size of the sketch will be.   Or, if we discover from experience, that this 
size is too conservative, from the floats sketch that computes the CDF of size, 
we could choose the 90th percentile size instead of the max size, etc. 
   
   In other words, instead of trying to track the size of the sketch before and 
after each update, which is hugely expensive, can we compute a memory budget 
for the sketch to grow in and use that for memory planning? 
   
   One more question I wanted to ask you is what Java version are you using?  
We are in the planning stages to move to Java 17 and 21, but we will have to 
draw a hard line in the sand for Java 17.  In other words, at a specific 
DataSketches Version (perhaps version 7 or 8), Java 17+ will be required.  If 
you will still need Java versions < 17, you will have to use earlier versions 
of the library.  How will this impact you?
   
   Cheers,
   Lee.
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to