leerho commented on PR #554: URL: https://github.com/apache/datasketches-java/pull/554#issuecomment-2089341949
Thanks Zac, this is super helpful. I was not aware that you offer sketches for use by your DB-users. That is wonderful! Now I realize you have two use cases: one for User Queries (UQ), and one for Query Planning and Optimization (QPO) used by the DB developers. Is it fair to assume that the QPO task is always performed prior to any UQ? In other words, you have the opportunity to learn a lot about the user's tables prior to the user using them. If this is true, then from the QPO pass, you could have in advance of any UQ the **_distribution of element sizes_** of any column the user might select. (I would hope that this kind of table metadata could be constructed as a table is being built, but that is a different discussion.) As I have shown you in several graphs, the average size of KLL sketch is quite predictable from _**k**_ the configured sketch size, **_n_** the total number of elements fed to the sketch and **_s_** the size of the elements fed to the sketch. If _**s**_ is a random variable, then all we need is the CDF. Most importantly, we can establish upper bounds on both of those values. The most conservative approach would be to set _**n**_ to be the number of rows of the entire table and _**s**_ to be the largest element in the column of interest. From that we can predict pretty accurately (and very quickly) what the size of the sketch will be. Or, if we discover from experience, that this size is too conservative, from the floats sketch that computes the CDF of size, we could choose the 90th percentile size instead of the max size, etc. In other words, instead of trying to track the size of the sketch before and after each update, which is hugely expensive, can we compute a memory budget for the sketch to grow in and use that for memory planning? One more question I wanted to ask you is what Java version are you using? We are in the planning stages to move to Java 17 and 21, but we will have to draw a hard line in the sand for Java 17. In other words, at a specific DataSketches Version (perhaps version 7 or 8), Java 17+ will be required. If you will still need Java versions < 17, you will have to use earlier versions of the library. How will this impact you? Cheers, Lee. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
