ZacBlanco commented on PR #554:
URL: 
https://github.com/apache/datasketches-java/pull/554#issuecomment-2085288136

   Hi Lee, I really appreciate your response. Below are some answers to your 
questions.
   
   > The first concern that jumps out at me is that your specific problem and 
the solution you propose is focused on items as Strings. Although I recognize 
that Strings may be a dominant use case for the items sketches, it is not the 
only application, as items sketches can be used for arbitrary objects. So I am 
hesitant to put into the sketch a lot of special code that is only useful for 
Strings.
   
   I realize my initial post mentions mostly strings because it's a harder 
problem to solve, but you can see from the performance benchmarks that 
`getSerializedSizeBytes` is also more than an order of magnitude slower for 
`KllItemsSketch<Double>` than the primitive version, `KllDoublesSketch`. My 
initial implementation using KLLs in our database used the Items sketch because 
of the API convenience of generics, however I didn't expect such a large 
performance discrepancy. The performance issue was initially identified due to 
slow aggregations which used `KllItemsSketch<Double>` and tracked their size 
through `getSerializedSizeBytes`.
   
   Replacing the `KllItemsSketch<Double>` implementation for just double types 
with `KllDoublesSketch` is an option for us (and one I am currently working 
on), but I think there are optimizations to be made for the items sketch.
   
   I would also like to clarify the use case here. I'm coming from an engine 
author perspective. I currently work with SQL query optimizers and we're using 
KLL sketches as the backing data structure for histogram column statistics on 
large tables. The main objective of using these sketches is to estimate the 
result of predicates`=, !=, <, <=, >, >=` etc applied to rows such that the 
optimizer's CBO can create better query plans.
   
   >  ... how "rough" is this need? +/- 1%, 10%, 50% ? Because I think we can 
come up with a good approximation of the memory usage that may be much simpler 
than wedging a bunch of code for a specific use case into the sketch itself.
   
   The penalty for wrong estimates is queries that fail to execute due to 
excessive untracked memory usage and hitting heap limits. I'm not a database 
users, so I can't speak exactly to the variety of workloads. Data distributions 
also depend on customers/users. I would argue an absolute error of <= 5% is 
acceptable. >10% would be unacceptable. Overestimates are better than 
underestimates as overestimate would simply shift scheduling around in the 
cluster. This could cause query starvation under high load, but I think that 
would be preferred behavior compared to failing queries.
   
   > Do these extrapolations sound about right? If you are not sure, would it 
be possible for you to characterize your string length, row size, and group 
size distributions of your raw data? 
   
   As I mentioned, I am the engine author, not a consumer so I can't speak to 
the data distribution. In my original comment, I meant 1T as _1T rows_. The 
table itself could be 3 columns or 300. For the sake of this discussion let's 
say the median string size is 64 bytes for a varchar-typed column, and we're 
focusing on the computation of a single column.
   
   With a string-size of 64 bytes and 1T rows, you're looking at 64T of data. 
Across 100M groups that's roughly 640k of data per group and 100K items per 
sketch in the group.
   
   If you assumed sketches were ~5KiB/each with 100M groups you're looking at 
~50 GiB of memory to store all of the sketches. You could drop the group number 
by an order of magnitude to 10M and still have ~5GiB of sketches accumulate in 
memory. Depending on cluster node size, this can be a good chunk of memory that 
needs to be accounted for in the JVM or else we could OOM.
   
   You could probably even scale my original scenario down to 1B rows and 10M 
groups and the same math above would still apply
   
   > What value of K are you using for your sketches now? And what is your 
requirement for rank accuracy that led you to choose this value of K?
   
   Currently, the engine defaults to the library default of k = 200. This could 
be configured by the DB user though.
   
   > I assume, although you didn't say exactly, that having an estimate of how 
much memory your aggregation functions are using is important for memory 
planning and partitioning, and more important than the fact that they are using 
memory. Is that correct?
   
   Having an idea of the memory utilization is important to prevent query 
failures under system load and to ensure good scheduling decisions are made and 
that queries are not starved of resources when they may actually be available.
   
   ---
   
   I do like your idea of using a separate sketch to track the distribution of 
strings sizes. However, it's not 100% clear to me how to deduce the memory 
utilization of the sketch using the distribution of sizes. I presume you would 
need to have some kind of information on the relationship of whether an item 
retained relates to its size which isn't clear to me because the retained items 
usually depend on insertion frequency? Either way, more elaboration on that 
idea would be great.
   
   We are actually currently using a set of linear regression parameters which 
we computed to estimate the true in-memory size. It is calculated with a 
formula like `true_size = linear_param * sketch.getSerializedSizeBytes() + 
constant_param`. We calculated the parameters using our own canned test data. 
The parameters for the String parameter assume a constant length, but it could 
be far off depending on the user's data. An ideal solution would either build 
the size-based sketch on the fly, or not rely on pre-baked parameters at all. 
DB users could have vastly different data distributions could make the 
parameters inaccurate for different scenarios, which is why I attempted the 
solution in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to