ZacBlanco commented on PR #554:
URL: 
https://github.com/apache/datasketches-java/pull/554#issuecomment-2082973601

   Hi Jon,
   
   Thanks for your quick response. I think this change presents little overhead 
except for the ItemsSketch which uses non-fixed fixed-width types - 
specifically on Strings. I’m willing to iterate on this to find a solution that 
the community would be willing to accept. I don’t expect the PR to be accepted 
in the current state.
   
   First, I’ll try to paint a picture of why I think this change is needed. 
Next, I’ll also provide a few of the microbenchmarks comparison between the 
master branch and this PR for a number of operations to show the impact of the 
PR on the performance.
   
   ----
   
   ### Background
   
   The system we are trying to use KLL sketches in is large distributed 
database engine. Tables can easily have 100TB+ of data. We’re implementing a 
KLL Sketch aggregation function for a variety different data types. Fixed-width 
types work great, but we’re also trying to implement them for strings.
   
   Imagine a table with 1T rows, then you run a `GROUP BY` query on that table. 
The cardinality of the GROUP BY is somewhere in the range of 100M.  The 
execution engine splits up the tasks and tries to schedule those tasks in a way 
that satisfies the memory constraints of the cluster. If you have 100M KLL 
sketches, depending on the parameters, it could amount to a not-insignificant 
chunk of memory and may need to make a task scheduling decision to move 
execution to another node based on that information. Hence, the system needs to 
track roughly how much memory an aggregation is using.
   
   The design of the database engine is such that querying the current 
estimated size needs to be very fast, which means we need to incrementally 
track the size of KLL sketches as items are added. In the JVM, exact memory 
tracking is very expensive, but even a good ballpark number is useful for us 
scheduling these tasks.
   
   In the current iteration of the sketch function implementation, we tried 
something like this:
   
   ```java
   memoryUsage -= sketch.getSerializedSizeBytes();
   sketch.update(item);
   memoryUsage += sketch.getSerializedSizeBytes();
   ```
   
   Unfortunately, `getSerializedSizeBytes()` is so expensive, it slows the 
queries down to a crawl for all types in the ItemsSketch. It’s over ~100X 
slower to do this than to simply call `sketch.update`. One of the main culprits 
which causes the slowdown is that for ItemsSketch, we end up doing a 
potentially large array copy for every call. The call chain is:
   
   
   
https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllSketch.java#L320
   
   which calls:
   
   
https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllHeapItemsSketch.java#L235
   
   which calls:
   
   
https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllHeapItemsSketch.java#L235
   
   
   `getRetainedItemsByteArr` is doing a very large and useless allocation which 
makes getting the current size far too expensive. 
   
   I hope the above explanation clears up the motivation behind this PR. Please 
feel free to ask any questions.
   
   ### Performance
   
   I wrote some basic microbenchmarks with JMH to test the performance of these 
methods. The patch with benchmarks can be found here: 
https://github.com/ZacBlanco/datasketches-java/commit/102e383d72edd044e161c96e3031e28a303b14f5.
 The benchmarks test the performance of doubles, floats, items (strings and 
doubles) for the methods: update, getSerializedSizeBytes, and 
getTotalItemsNumBytes. The results are below.
   
   
![image](https://github.com/apache/datasketches-java/assets/3464508/2b39b6aa-0b8a-443e-a5c1-5017776ac7a5)
   
![image](https://github.com/apache/datasketches-java/assets/3464508/c96d4441-2be1-47cd-ad62-48f5b8a7d5a9)
   
   You can see that aside from the update for `KllItemsSketch<String>`, the 
performance of virtually every other method is unchanged. Additionally, the 
cost of getTotalItemsNumBytes is multiple orders of magnitude lower than the 
cost of `getSerializedSizeBytes()`(Doubles: ~615 vs 10ns, Strings, ~200k ns vs 
9ns), which solves my initial problem of having a quick way to get the rough 
memory usage of the ItemsSketch.
   
   ### Additional Optimizations
   
   I think there are two related, but slightly different optimizations that can 
be made. 
   
   First, in the ArrayOfStringsSerDe, to calculate the size of the byte array, 
the `sizeOf` method utilizes `String#getBytes`, which copies the existing 
String into a new byte array, and then returns the new array where only the 
length of the array is queried. This is quite slow and results in many more 
allocations than necessary. I would highly recommend the library adopt a method 
which calculates the length of the strings without these additional allocations.
   
   There is an implementation in the Guava library which can probably be 
directly copied and dropped in to the datasketches library without having to 
pull in any dependencies. 
https://github.com/google/guava/blob/9596c996098272f0ea69cfdf45c7c1fe1ef8406d/guava/src/com/google/common/base/Utf8.java#L49
   
   While I think there is no downside with this improvement, it doesn’t have a 
large enough performance impact to make `getSerializedSizeBytes` fast enough 
for me to use.
   
   Second, I tested with tracking the size of `getTotalNumBytes` simply using 
`String#length` rather than doing a UTF-8 encoding inside of 
ArrayOfStringSerde#sizeOf. With that change, the update performance for the 
Strings sketch was virtually no different than without the size tracking. 
Unfortunately though, it means that we wouldn’t be able to support UTF-8 
strings. Do you think there would be a good way to introduce using the 
String#length call to perform the incremental size tracking?
   
   Thanks for taking the time to read. Please let me know your thoughts.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to