discivigour commented on PR #7842:
URL: https://github.com/apache/paimon/pull/7842#issuecomment-4485799665

   > Thanks for the update. I took another pass over the latest revision, and I 
think there are still a few issues that should be fixed before merging.
   > 
   > 1. The boundary condition for interval overlap still looks wrong.
   >    In `ManifestFileSorter.buildLevelSortedRuns`, a file is appended to an 
existing run when `file.min >= last.max`. In `splitIntoSections`, a new section 
is also started when `file.min >= sectionMaxBound`.
   >    However, partition stats represent closed intervals. For example, `[1, 
3]` and `[3, 5]` still overlap at partition value `3`. A sorted run is 
documented as containing non-overlapping intervals, so this case should not be 
placed into the same run. Similarly, sections are used as overlap-connected 
rewrite units, so this case should not be split into different sections either. 
I think both checks should use `> 0`, not `>= 0`, and we should add a test for 
the `max == min` boundary case.
   > 2. `manifest-sort.enabled` currently bypasses the original manifest 
compaction trigger/gate.
   >    `ManifestFileMerger.merge` directly enters 
`ManifestFileSorter.trySortRewrite` and returns from that path when manifest 
sort is enabled. This means the original 
`manifest.full-compaction-threshold-size` and `manifest.merge-min-count` 
behavior is no longer applied in the same way. Inside `classifyManifests`, 
files are classified only by `fileSize < targetSize` or delete-range overlap, 
so small manifests / delete manifests can trigger sort rewrite more 
aggressively than the existing merge logic.
   >    If this is intentional, I think the new semantics should be documented 
very clearly. Otherwise, the sort path should preserve the existing full/minor 
compaction gates, especially `manifest.merge-min-count` for minor manifest 
merging.
   > 3. The partial rewrite path for `manifest-sort.max-rewrite-size` can break 
the output order.
   >    When the first section exceeds the rewrite budget, `rewriteSections` 
splits it into `rewriteFiles` and `remainingFiles`, rewrites the first part, 
and appends the remaining section to the end of the `sections` list. If there 
are later sections with larger key ranges, the remaining part of the current 
section will be emitted after them, which can produce an order like `0..10, 
20..30, 10..20`.
   >    To keep the manifest list sorted, I think we should either skip the 
whole section once the budget is exceeded, or keep the remaining section at the 
current position/order instead of appending it to the tail. This also needs a 
regression test.
   > 4. Test coverage is still missing some important edge cases.
   >    The new tests cover large overlapping ranges and delete elimination, 
but I do not see coverage for boundary-touching intervals (`max == min`), 
`manifest.merge-min-count` / full threshold behavior under 
`manifest-sort.enabled`, `manifest-sort.max-rewrite-size` preserving global 
output order, or null partition values. The switch to `RecordComparator` is a 
good improvement, but a null partition test would make this safer.
   Thanks for your comment.
   1. I designed it this way to ensure that the minimum number of Sorted runs 
is built to reduce the burden of sorting.
   2. I have introduced manifestFullCompactionThresholdSize to to reduce the 
phenomenon of "one delete entry causing large-scale file rewriting.
   3. I think it is not a problem.
   4. I will add more tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to