nateab opened a new pull request, #27549: URL: https://github.com/apache/flink/pull/27549
## What is the purpose of the change `BinaryStringData.copy()` currently always allocates a new byte array and copies the underlying data, even when the source string is already "compact" - meaning it occupies a single memory segment at offset 0 with no wasted space. In this case, the string exclusively owns its backing memory and `copy()` would produce an identical, semantically equivalent object. This PR adds a fast path that returns `this` when the string is already compact, avoiding unnecessary allocation and memory copying. **Benchmark Results (JMH: 2 forks, 5 warmup, 5 measurement iterations):** | Benchmark | Baseline (ops/ms) | Optimized (ops/ms) | Improvement | |-----------|-------------------|---------------------|-------------| | copyCompactString | 18,381 ± 22,770 | 807,198 ± 13,825 | **+4,292%** | | copyLargeString | 3,551 ± 4,718 | 748,334 ± 6,211 | **+20,975%** | The baseline had high variance due to GC pressure from allocation. The optimized version eliminates allocation entirely for compact strings. **Why this is safe:** 1. `BinaryStringData` is immutable - it has no mutation methods 2. When all three conditions are true (`segments.length == 1`, `offset == 0`, `segment.size() == sizeInBytes`), the string was created via `fromString()` or a previous `copy()` call, meaning it owns its backing memory exclusively 3. Strings that are slices into larger buffers (network buffers, serialization buffers) will have `offset > 0` or `segment.size() > sizeInBytes`, so they still get copied ## Brief change log - Added fast-path check in `BinaryStringData.copy()` that returns `this` when the string is already compact (single segment, offset 0, no wasted space) ## Verifying this change This change is already covered by existing tests, such as: - `BinaryStringDataTest` (100 tests) - all existing tests pass, validating that copy semantics are preserved - `StringDataBenchmark` JMH benchmark confirms the performance improvement ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: yes - `BinaryStringData` is used by Table API serializers, but the change preserves identical semantics - The runtime per-record code paths (performance sensitive): yes - this is a performance optimization for the serialization hot path - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
