ZacBlanco commented on PR #554: URL: https://github.com/apache/datasketches-java/pull/554#issuecomment-2082973601
Hi Jon, Thanks for your quick response. I think this change presents little overhead except for the ItemsSketch which uses non-fixed fixed-width types - specifically on Strings. I’m willing to iterate on this to find a solution that the community would be willing to accept. I don’t expect the PR to be accepted in the current state. First, I’ll try to paint a picture of why I think this change is needed. Next, I’ll also provide a few of the microbenchmarks comparison between the master branch and this PR for a number of operations to show the impact of the PR on the performance. ---- ### Background The system we are trying to use KLL sketches in is large distributed database engine. Tables can easily have 100TB+ of data. We’re implementing a KLL Sketch aggregation function for a variety different data types. Fixed-width types work great, but we’re also trying to implement them for strings. Imagine a table with 1T rows, then you run a `GROUP BY` query on that table. The cardinality of the GROUP BY is somewhere in the range of 100M. The execution engine splits up the tasks and tries to schedule those tasks in a way that satisfies the memory constraints of the cluster. If you have 100M KLL sketches, depending on the parameters, it could amount to a not-insignificant chunk of memory and may need to make a task scheduling decision to move execution to another node based on that information. Hence, the system needs to track roughly how much memory an aggregation is using. The design of the database engine is such that querying the current estimated size needs to be very fast, which means we need to incrementally track the size of KLL sketches as items are added. In the JVM, exact memory tracking is very expensive, but even a good ballpark number is useful for us scheduling these tasks. In the current iteration of the sketch function implementation, we tried something like this: ```java memoryUsage -= sketch.getSerializedSizeBytes(); sketch.update(item); memoryUsage += sketch.getSerializedSizeBytes(); ``` Unfortunately, `getSerializedSizeBytes()` is so expensive, it slows the queries down to a crawl for all types in the ItemsSketch. It’s over ~100X slower to do this than to simply call `sketch.update`. One of the main culprits which causes the slowdown is that for ItemsSketch, we end up doing a potentially large array copy for every call. The call chain is: https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllSketch.java#L320 which calls: https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllHeapItemsSketch.java#L235 which calls: https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllHeapItemsSketch.java#L235 `getRetainedItemsByteArr` is doing a very large and useless allocation which makes getting the current size far too expensive. I hope the above explanation clears up the motivation behind this PR. Please feel free to ask any questions. ### Performance I wrote some basic microbenchmarks with JMH to test the performance of these methods. The patch with benchmarks can be found here: https://github.com/ZacBlanco/datasketches-java/commit/102e383d72edd044e161c96e3031e28a303b14f5. The benchmarks test the performance of doubles, floats, items (strings and doubles) for the methods: update, getSerializedSizeBytes, and getTotalItemsNumBytes. The results are below.   You can see that aside from the update for `KllItemsSketch<String>`, the performance of virtually every other method is unchanged. Additionally, the cost of getTotalItemsNumBytes is multiple orders of magnitude lower than the cost of `getSerializedSizeBytes()`(Doubles: ~615 vs 10ns, Strings, ~200k ns vs 9ns), which solves my initial problem of having a quick way to get the rough memory usage of the ItemsSketch. ### Additional Optimizations I think there are two related, but slightly different optimizations that can be made. First, in the ArrayOfStringsSerDe, to calculate the size of the byte array, the `sizeOf` method utilizes `String#getBytes`, which copies the existing String into a new byte array, and then returns the new array where only the length of the array is queried. This is quite slow and results in many more allocations than necessary. I would highly recommend the library adopt a method which calculates the length of the strings without these additional allocations. There is an implementation in the Guava library which can probably be directly copied and dropped in to the datasketches library without having to pull in any dependencies. https://github.com/google/guava/blob/9596c996098272f0ea69cfdf45c7c1fe1ef8406d/guava/src/com/google/common/base/Utf8.java#L49 While I think there is no downside with this improvement, it doesn’t have a large enough performance impact to make `getSerializedSizeBytes` fast enough for me to use. Second, I tested with tracking the size of `getTotalNumBytes` simply using `String#length` rather than doing a UTF-8 encoding inside of ArrayOfStringSerde#sizeOf. With that change, the update performance for the Strings sketch was virtually no different than without the size tracking. Unfortunately though, it means that we wouldn’t be able to support UTF-8 strings. Do you think there would be a good way to introduce using the String#length call to perform the incremental size tracking? Thanks for taking the time to read. Please let me know your thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
