guowangy opened a new pull request, #11722:
URL: https://github.com/apache/incubator-gluten/pull/11722

   ## What changes are proposed in this pull request?
   
   Introduces multi-segment-per-partition support in the Velox backend columnar 
shuffle writer, enabling incremental flushing of partition data to the final 
data file during processing — reducing peak memory usage without requiring full 
in-memory buffering or temporary spill files. The implementation can reduce 
total latency of TPC-H(SF6T) by ~16% using sort-based shuffle with low memory 
capacity in 2-socket Xeon 6960P system.
   
   ### New index file format (`ColumnarIndexShuffleBlockResolver`)
   
   Extends `IndexShuffleBlockResolver` with a new index format supporting 
multiple `(offset, length)` segments per partition:
   
   ```
   [Partition Index: (N+1) × 8-byte big-endian offsets]
   [Segment Data: per-partition list of (data_offset, length) pairs, each 8 
bytes]
   [1-byte end marker]  ← distinguishes from legacy format (size always 
multiple of 8)
   ```
   <img width="1013" height="381" alt="image" 
src="https://github.com/user-attachments/assets/74e278e1-3d8e-4225-95a6-7c9eeac133ce";
 />
   
   
   `ColumnarShuffleManager` now uses this resolver. Multi-segment mode 
activates only when external shuffle service, push-based shuffle, and 
dictionary encoding are all disabled (dictionary encoding requires 
all-batches-complete before writing).
   
   ### New I/O abstractions
   
   - **`FileSegmentsInputStream`** — `InputStream` over non-contiguous 
`(offset, size)` file segments; supports zero-copy native reads via 
`read(destAddress, maxSize)`
   - **`FileSegmentsManagedBuffer`** — `ManagedBuffer` backed by discontiguous 
segments; supports `nioByteBuffer()`, `createInputStream()`, `convertToNetty()`
   - **`DiscontiguousFileRegion`** — Netty `FileRegion` mapping a logical range 
to multiple physical segments for zero-copy network transfer
   - **`LowCopyFileSegmentsJniByteInputStream`** — zero-copy JNI wrapper over 
`FileSegmentsInputStream`; wired into `JniByteInputStreams.create()`
   
   ### C++ `LocalPartitionWriter` changes
   
   - `usePartitionMultipleSegments_` flag + `partitionSegments_` vector 
tracking `(start, length)` per partition
   - `flushCachedPayloads()` — incremental flush after each `hashEvict`
   - `writeMemoryPayload()` — direct write to final data file during `sortEvict`
   - `writeIndexFile()` — serializes the new index at stop time
   - `PayloadCache::writeIncremental()` — flushes completed (non-active) 
partitions without touching the in-use partition
   
   ### JNI/JVM wiring
   
   `LocalPartitionWriterJniWrapper` and `JniWrapper.cc` accept a new optional 
`indexFile` parameter; `ColumnarShuffleWriter` passes the temp index file path 
when multi-segment mode is active.
   
   ## How was this patch tested?
   
   New unit test suites:
   - `ColumnarIndexShuffleBlockResolverSuite` — index format read/write, format 
detection, multi-segment block lookup
   - `FileSegmentsInputStreamSuite` — sequential reads, multi-segment 
traversal, skip, zero-copy native reads
   - `FileSegmentsManagedBufferSuite` — `nioByteBuffer`, `createInputStream`, 
`convertToNetty`, EOF and mmap edge cases
   - `DiscontiguousFileRegionSuite` — Netty transfer across discontiguous 
segments, lazy open
   - `LowCopyFileSegmentsJniByteInputStreamTest` — JNI wrapper correctness for 
ByteInputStream
   
   ## Was this patch authored or co-authored using generative AI tooling?
   
   <!--
   If generative AI tooling has been used in the process of authoring this 
patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling 
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to