ZacBlanco commented on PR #554: URL: https://github.com/apache/datasketches-java/pull/554#issuecomment-2085288136
Hi Lee, I really appreciate your response. Below are some answers to your questions. > The first concern that jumps out at me is that your specific problem and the solution you propose is focused on items as Strings. Although I recognize that Strings may be a dominant use case for the items sketches, it is not the only application, as items sketches can be used for arbitrary objects. So I am hesitant to put into the sketch a lot of special code that is only useful for Strings. I realize my initial post mentions mostly strings because it's a harder problem to solve, but you can see from the performance benchmarks that `getSerializedSizeBytes` is also more than an order of magnitude slower for `KllItemsSketch<Double>` than the primitive version, `KllDoublesSketch`. My initial implementation using KLLs in our database used the Items sketch because of the API convenience of generics, however I didn't expect such a large performance discrepancy. The performance issue was initially identified due to slow aggregations which used `KllItemsSketch<Double>` and tracked their size through `getSerializedSizeBytes`. Replacing the `KllItemsSketch<Double>` implementation for just double types with `KllDoublesSketch` is an option for us (and one I am currently working on), but I think there are optimizations to be made for the items sketch. I would also like to clarify the use case here. I'm coming from an engine author perspective. I currently work with SQL query optimizers and we're using KLL sketches as the backing data structure for histogram column statistics on large tables. The main objective of using these sketches is to estimate the result of predicates`=, !=, <, <=, >, >=` etc applied to rows such that the optimizer's CBO can create better query plans. > ... how "rough" is this need? +/- 1%, 10%, 50% ? Because I think we can come up with a good approximation of the memory usage that may be much simpler than wedging a bunch of code for a specific use case into the sketch itself. The penalty for wrong estimates is queries that fail to execute due to excessive untracked memory usage and hitting heap limits. I'm not a database users, so I can't speak exactly to the variety of workloads. Data distributions also depend on customers/users. I would argue an absolute error of <= 5% is acceptable. >10% would be unacceptable. Overestimates are better than underestimates as overestimate would simply shift scheduling around in the cluster. This could cause query starvation under high load, but I think that would be preferred behavior compared to failing queries. > Do these extrapolations sound about right? If you are not sure, would it be possible for you to characterize your string length, row size, and group size distributions of your raw data? As I mentioned, I am the engine author, not a consumer so I can't speak to the data distribution. In my original comment, I meant 1T as _1T rows_. The table itself could be 3 columns or 300. For the sake of this discussion let's say the median string size is 64 bytes for a varchar-typed column, and we're focusing on the computation of a single column. With a string-size of 64 bytes and 1T rows, you're looking at 64T of data. Across 100M groups that's roughly 640k of data per group and 100K items per sketch in the group. If you assumed sketches were ~5KiB/each with 100M groups you're looking at ~50 GiB of memory to store all of the sketches. You could drop the group number by an order of magnitude to 10M and still have ~5GiB of sketches accumulate in memory. Depending on cluster node size, this can be a good chunk of memory that needs to be accounted for in the JVM or else we could OOM. You could probably even scale my original scenario down to 1B rows and 10M groups and the same math above would still apply > What value of K are you using for your sketches now? And what is your requirement for rank accuracy that led you to choose this value of K? Currently, the engine defaults to the library default of k = 200. This could be configured by the DB user though. > I assume, although you didn't say exactly, that having an estimate of how much memory your aggregation functions are using is important for memory planning and partitioning, and more important than the fact that they are using memory. Is that correct? Having an idea of the memory utilization is important to prevent query failures under system load and to ensure good scheduling decisions are made and that queries are not starved of resources when they may actually be available. --- I do like your idea of using a separate sketch to track the distribution of strings sizes. However, it's not 100% clear to me how to deduce the memory utilization of the sketch using the distribution of sizes. I presume you would need to have some kind of information on the relationship of whether an item retained relates to its size which isn't clear to me because the retained items usually depend on insertion frequency? Either way, more elaboration on that idea would be great. We are actually currently using a set of linear regression parameters which we computed to estimate the true in-memory size. It is calculated with a formula like `true_size = linear_param * sketch.getSerializedSizeBytes() + constant_param`. We calculated the parameters using our own canned test data. The parameters for the String parameter assume a constant length, but it could be far off depending on the user's data. An ideal solution would either build the size-based sketch on the fly, or not rely on pre-baked parameters at all. DB users could have vastly different data distributions could make the parameters inaccurate for different scenarios, which is why I attempted the solution in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
