guowangy opened a new pull request, #11722: URL: https://github.com/apache/incubator-gluten/pull/11722
## What changes are proposed in this pull request? Introduces multi-segment-per-partition support in the Velox backend columnar shuffle writer, enabling incremental flushing of partition data to the final data file during processing — reducing peak memory usage without requiring full in-memory buffering or temporary spill files. The implementation can reduce total latency of TPC-H(SF6T) by ~16% using sort-based shuffle with low memory capacity in 2-socket Xeon 6960P system. ### New index file format (`ColumnarIndexShuffleBlockResolver`) Extends `IndexShuffleBlockResolver` with a new index format supporting multiple `(offset, length)` segments per partition: ``` [Partition Index: (N+1) × 8-byte big-endian offsets] [Segment Data: per-partition list of (data_offset, length) pairs, each 8 bytes] [1-byte end marker] ← distinguishes from legacy format (size always multiple of 8) ``` <img width="1013" height="381" alt="image" src="https://github.com/user-attachments/assets/74e278e1-3d8e-4225-95a6-7c9eeac133ce" /> `ColumnarShuffleManager` now uses this resolver. Multi-segment mode activates only when external shuffle service, push-based shuffle, and dictionary encoding are all disabled (dictionary encoding requires all-batches-complete before writing). ### New I/O abstractions - **`FileSegmentsInputStream`** — `InputStream` over non-contiguous `(offset, size)` file segments; supports zero-copy native reads via `read(destAddress, maxSize)` - **`FileSegmentsManagedBuffer`** — `ManagedBuffer` backed by discontiguous segments; supports `nioByteBuffer()`, `createInputStream()`, `convertToNetty()` - **`DiscontiguousFileRegion`** — Netty `FileRegion` mapping a logical range to multiple physical segments for zero-copy network transfer - **`LowCopyFileSegmentsJniByteInputStream`** — zero-copy JNI wrapper over `FileSegmentsInputStream`; wired into `JniByteInputStreams.create()` ### C++ `LocalPartitionWriter` changes - `usePartitionMultipleSegments_` flag + `partitionSegments_` vector tracking `(start, length)` per partition - `flushCachedPayloads()` — incremental flush after each `hashEvict` - `writeMemoryPayload()` — direct write to final data file during `sortEvict` - `writeIndexFile()` — serializes the new index at stop time - `PayloadCache::writeIncremental()` — flushes completed (non-active) partitions without touching the in-use partition ### JNI/JVM wiring `LocalPartitionWriterJniWrapper` and `JniWrapper.cc` accept a new optional `indexFile` parameter; `ColumnarShuffleWriter` passes the temp index file path when multi-segment mode is active. ## How was this patch tested? New unit test suites: - `ColumnarIndexShuffleBlockResolverSuite` — index format read/write, format detection, multi-segment block lookup - `FileSegmentsInputStreamSuite` — sequential reads, multi-segment traversal, skip, zero-copy native reads - `FileSegmentsManagedBufferSuite` — `nioByteBuffer`, `createInputStream`, `convertToNetty`, EOF and mmap edge cases - `DiscontiguousFileRegionSuite` — Netty transfer across discontiguous segments, lazy open - `LowCopyFileSegmentsJniByteInputStreamTest` — JNI wrapper correctness for ByteInputStream ## Was this patch authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this patch, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
