Re: [PR] Add rough memory size tracking with KllItemsSketch (datasketches-java)

via GitHub Mon, 29 Apr 2024 08:01:19 -0700


ZacBlanco commented on PR #554:
URL: 
https://github.com/apache/datasketches-java/pull/554#issuecomment-2082973601

Hi Jon,

Thanks for your quick response. I think this change presents little overhead
except for the ItemsSketch which uses non-fixed fixed-width types -
specifically on Strings. I’m willing to iterate on this to find a solution that
the community would be willing to accept. I don’t expect the PR to be accepted
in the current state.

First, I’ll try to paint a picture of why I think this change is needed.
Next, I’ll also provide a few of the microbenchmarks comparison between the
master branch and this PR for a number of operations to show the impact of the
PR on the performance.

----

### Background

The system we are trying to use KLL sketches in is large distributed
database engine. Tables can easily have 100TB+ of data. We’re implementing a
KLL Sketch aggregation function for a variety different data types. Fixed-width
types work great, but we’re also trying to implement them for strings.

Imagine a table with 1T rows, then you run a `GROUP BY` query on that table.
The cardinality of the GROUP BY is somewhere in the range of 100M. The
execution engine splits up the tasks and tries to schedule those tasks in a way
that satisfies the memory constraints of the cluster. If you have 100M KLL
sketches, depending on the parameters, it could amount to a not-insignificant
chunk of memory and may need to make a task scheduling decision to move
execution to another node based on that information. Hence, the system needs to
track roughly how much memory an aggregation is using.

The design of the database engine is such that querying the current
estimated size needs to be very fast, which means we need to incrementally
track the size of KLL sketches as items are added. In the JVM, exact memory
tracking is very expensive, but even a good ballpark number is useful for us
scheduling these tasks.

In the current iteration of the sketch function implementation, we tried
something like this:

```java
memoryUsage -= sketch.getSerializedSizeBytes();
sketch.update(item);
memoryUsage += sketch.getSerializedSizeBytes();
```

Unfortunately, `getSerializedSizeBytes()` is so expensive, it slows the
queries down to a crawl for all types in the ItemsSketch. It’s over ~100X
slower to do this than to simply call `sketch.update`. One of the main culprits
which causes the slowdown is that for ItemsSketch, we end up doing a
potentially large array copy for every call. The call chain is:

https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllSketch.java#L320

which calls:

https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllHeapItemsSketch.java#L235

which calls:

https://github.com/apache/datasketches-java/blob/0bff44b13ccf2eb27d2519942016d9d83327af5f/src/main/java/org/apache/datasketches/kll/KllHeapItemsSketch.java#L235

`getRetainedItemsByteArr` is doing a very large and useless allocation which
makes getting the current size far too expensive.

I hope the above explanation clears up the motivation behind this PR. Please
feel free to ask any questions.

### Performance

I wrote some basic microbenchmarks with JMH to test the performance of these
methods. The patch with benchmarks can be found here:
https://github.com/ZacBlanco/datasketches-java/commit/102e383d72edd044e161c96e3031e28a303b14f5.
The benchmarks test the performance of doubles, floats, items (strings and
doubles) for the methods: update, getSerializedSizeBytes, and
getTotalItemsNumBytes. The results are below.

![image](https://github.com/apache/datasketches-java/assets/3464508/2b39b6aa-0b8a-443e-a5c1-5017776ac7a5)

![image](https://github.com/apache/datasketches-java/assets/3464508/c96d4441-2be1-47cd-ad62-48f5b8a7d5a9)

You can see that aside from the update for `KllItemsSketch<String>`, the
performance of virtually every other method is unchanged. Additionally, the
cost of getTotalItemsNumBytes is multiple orders of magnitude lower than the
cost of `getSerializedSizeBytes()`(Doubles: ~615 vs 10ns, Strings, ~200k ns vs
9ns), which solves my initial problem of having a quick way to get the rough
memory usage of the ItemsSketch.

### Additional Optimizations

I think there are two related, but slightly different optimizations that can
be made.

First, in the ArrayOfStringsSerDe, to calculate the size of the byte array,
the `sizeOf` method utilizes `String#getBytes`, which copies the existing
String into a new byte array, and then returns the new array where only the
length of the array is queried. This is quite slow and results in many more
allocations than necessary. I would highly recommend the library adopt a method
which calculates the length of the strings without these additional allocations.

There is an implementation in the Guava library which can probably be
directly copied and dropped in to the datasketches library without having to
pull in any dependencies.
https://github.com/google/guava/blob/9596c996098272f0ea69cfdf45c7c1fe1ef8406d/guava/src/com/google/common/base/Utf8.java#L49

While I think there is no downside with this improvement, it doesn’t have a
large enough performance impact to make `getSerializedSizeBytes` fast enough
for me to use.

Second, I tested with tracking the size of `getTotalNumBytes` simply using
`String#length` rather than doing a UTF-8 encoding inside of
ArrayOfStringSerde#sizeOf. With that change, the update performance for the
Strings sketch was virtually no different than without the size tracking.
Unfortunately though, it means that we wouldn’t be able to support UTF-8
strings. Do you think there would be a good way to introduce using the
String#length call to perform the incremental size tracking?

Thanks for taking the time to read. Please let me know your thoughts.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add rough memory size tracking with KllItemsSketch (datasketches-java)

Reply via email to