baiyangtx opened a new pull request, #7866:
URL: https://github.com/apache/paimon/pull/7866

   ### Purpose
   
   Manifest entries are currently written in arrival order, which scatters 
entries belonging
   to the same partition across multiple manifest files. This leads to:
   
   - Inefficient partition-pruning during reads — the reader must scan many 
manifest files
     to find all entries for a single partition.
   - Suboptimal data locality when compaction reorganizes files.
   
   This PR introduces manifest entry sorting by partition, so that entries for 
the same
   partition are clustered together within manifest files.
   
   ### Changes
   
   **New configuration options:**
   
   | Option | Default | Description |
   |--------|---------|-------------|
   | `manifest.merge.sorted` | `true` | Sort entries by partition during 
manifest full compaction |
   | `manifest.merge.sort-on-commit` | `false` | Sort entries by partition 
during manifest full merge in commit |
   | `manifest.delta.sorted` | `true` | Sort entries by partition when writing 
manifest delta |
   
   **Core implementation:**
   
   - `ManifestFileMerger`: Introduces `mergeSortedByPartition()` and 
`mergeUnsorted()`.
     When `manifest.merge.sorted` is enabled, entries are collected into a
     `BinaryExternalSortBuffer` (with spill-to-disk support), then written in
     partition-major order: `(partition, bucket, level)`.
   - `ManifestFileMerger.createManifestEntryComparator()`: Comparator used for
     sorting delta manifests, falling back to pure-Java comparison when codegen
     is unavailable.
   - `FileStoreCommitImpl`: Wires sort parameters into all three paths —
     commit manifest merge, delta file writing, and manifest compaction.
   
   ### Tests
   
   ```
   # Manifest merge with sorting
   mvn -pl paimon-core -am -DfailIfNoTests=false \
       -Dtest=ManifestFileMetaTest test
   
   # No-partition edge case
   mvn -pl paimon-core -am -DfailIfNoTests=false \
       -Dtest=NoPartitionManifestFileMetaTest test
   
   # Spark compact procedure
   mvn -pl paimon-spark/paimon-spark-ut -am -DfailIfNoTests=false \
       -Dtest=CompactProcedureTestBase test
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to