baiyangtx opened a new pull request, #7866:
URL: https://github.com/apache/paimon/pull/7866
### Purpose
Manifest entries are currently written in arrival order, which scatters
entries belonging
to the same partition across multiple manifest files. This leads to:
- Inefficient partition-pruning during reads — the reader must scan many
manifest files
to find all entries for a single partition.
- Suboptimal data locality when compaction reorganizes files.
This PR introduces manifest entry sorting by partition, so that entries for
the same
partition are clustered together within manifest files.
### Changes
**New configuration options:**
| Option | Default | Description |
|--------|---------|-------------|
| `manifest.merge.sorted` | `true` | Sort entries by partition during
manifest full compaction |
| `manifest.merge.sort-on-commit` | `false` | Sort entries by partition
during manifest full merge in commit |
| `manifest.delta.sorted` | `true` | Sort entries by partition when writing
manifest delta |
**Core implementation:**
- `ManifestFileMerger`: Introduces `mergeSortedByPartition()` and
`mergeUnsorted()`.
When `manifest.merge.sorted` is enabled, entries are collected into a
`BinaryExternalSortBuffer` (with spill-to-disk support), then written in
partition-major order: `(partition, bucket, level)`.
- `ManifestFileMerger.createManifestEntryComparator()`: Comparator used for
sorting delta manifests, falling back to pure-Java comparison when codegen
is unavailable.
- `FileStoreCommitImpl`: Wires sort parameters into all three paths —
commit manifest merge, delta file writing, and manifest compaction.
### Tests
```
# Manifest merge with sorting
mvn -pl paimon-core -am -DfailIfNoTests=false \
-Dtest=ManifestFileMetaTest test
# No-partition edge case
mvn -pl paimon-core -am -DfailIfNoTests=false \
-Dtest=NoPartitionManifestFileMetaTest test
# Spark compact procedure
mvn -pl paimon-spark/paimon-spark-ut -am -DfailIfNoTests=false \
-Dtest=CompactProcedureTestBase test
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]