Re: [PR] [core] Support snapshot-based sequence ordering for primary-key tables [paimon]

via GitHub Sun, 17 May 2026 23:03:04 -0700


leaves12138 commented on code in PR #7832:
URL: https://github.com/apache/paimon/pull/7832#discussion_r3256662080



##########
paimon-core/src/main/java/org/apache/paimon/schema/SchemaValidation.java:
##########
@@ -578,6 +582,19 @@ private static void validateFileIndex(TableSchema schema) {
         }
     }
 
+    private static void validateSnapshotSequenceOrdering(TableSchema schema, 
CoreOptions options) {
+        checkArgument(
+                !schema.primaryKeys().isEmpty(),
+                "%s = true requires a primary-key table; append-only tables 
cannot use "
+                        + "snapshot-based sequence ordering.",
+                CoreOptions.SEQUENCE_SNAPSHOT_ORDERING.key());
+        checkArgument(
+                options.sequenceField().isEmpty(),
+                "%s = true is mutually exclusive with %s; the snapshot id is 
the sole tiebreaker.",
+                CoreOptions.SEQUENCE_SNAPSHOT_ORDERING.key(),
+                CoreOptions.SEQUENCE_FIELD.key());
+    }

Review Comment:
   This option is currently accepted for every primary-key merge engine, but 
the implementation only preserves `snapshotId` for merge functions that return 
an input `KeyValue`. For example, `PartialUpdateMergeFunction` and 
`AggregateMergeFunction` build a new `KeyValue` via `replace(...)`, which 
resets `snapshotId` to `UNKNOWN_SNAPSHOT_ID`. During compaction, 
`stampSequenceWithSnapshotId` then writes `-1` into `_SEQUENCE_NUMBER` / file 
sequence metadata, so later reads can order compacted records incorrectly. 
Could you either restrict `sequence.snapshot-ordering` to the supported merge 
engine(s) here, or propagate the winning snapshot id through all merge 
functions and add tests for partial-update / aggregation?



##########
paimon-api/src/main/java/org/apache/paimon/CoreOptions.java:
##########
@@ -965,6 +965,20 @@ public InlineElement getDescription() {
                     .defaultValue(SortOrder.ASCENDING)
                     .withDescription("Specify the order of sequence.field.");
 
+    @Immutable
+    public static final ConfigOption<Boolean> SEQUENCE_SNAPSHOT_ORDERING =
+            key("sequence.snapshot-ordering")
+                    .booleanType()
+                    .defaultValue(false)
+                    .withDescription(
+                            "When enabled, merge uses the commit snapshot id 
as the primary "

Review Comment:
   This option also looks unsafe to enable on a table that already has data 
written without the feature. Existing APPEND files have `minSequenceNumber` as 
the old sequence range, and existing COMPACT files have `_SEQUENCE_NUMBER` as 
the old per-record sequence number; after toggling this option on, readers will 
interpret those values as snapshot ids. Could this be documented and/or 
rejected for `ALTER TABLE` as a creation-only option? Otherwise an existing 
table can silently reorder old records.



##########
paimon-core/src/main/java/org/apache/paimon/mergetree/compact/LookupMergeTreeCompactRewriter.java:
##########
@@ -217,7 +221,8 @@ public MergeFunctionWrapper<ChangelogResult> create(
                     valueEqualiser,
                     lookupStrategy,
                     deletionVectorsMaintainer,
-                    userDefinedSeqComparator);
+                    userDefinedSeqComparator,
+                    snapshotSequenceOrdering);

Review Comment:
   The lookup changelog path can still lose the snapshot id when 
`LookupMergeFunction` spills its `KeyValueBuffer` to the binary buffer. 
`KeyValueBuffer.createBinaryBuffer` still constructs `new 
KeyValueWithLevelNoReusingSerializer(keyType, valueType)` without 
`includeSnapshotId`, so after `lookup.merge-records-threshold` is exceeded, 
deserialized candidates have `UNKNOWN_SNAPSHOT_ID` and this comparator falls 
back to sequence-only ordering. Please thread `snapshotSequenceOrdering` into 
`KeyValueBuffer`'s serializer and add a test that forces lookup-buffer spill, 
for example with a very small `lookup.merge-records-threshold` and an 
`IOManager`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [core] Support snapshot-based sequence ordering for primary-key tables [paimon]

Reply via email to