Re: [PR] [spark] Reject ALTER TABLE REPLACE COLUMNS to avoid silent data corruption [paimon]

via GitHub Tue, 16 Jun 2026 01:54:41 -0700


JingsongLi commented on code in PR #8246:
URL: https://github.com/apache/paimon/pull/8246#discussion_r3419366461



##########
paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/SparkCatalog.java:
##########
@@ -362,6 +369,36 @@ public org.apache.spark.sql.connector.catalog.Table 
alterTable(
         }
     }
 
+    /**
+     * Detects whether the given changes originate from an {@code ALTER TABLE 
... REPLACE COLUMNS}
+     * statement.
+     *
+     * <p>Spark translates {@code REPLACE COLUMNS} into a batch that drops 
every existing column and
+     * re-adds the new set, i.e. a combination of {@link 
TableChange.DeleteColumn} and {@link
+     * TableChange.AddColumn} only. Other column changes such as rename or 
type update are never
+     * produced by {@code REPLACE COLUMNS}, so we match exclusively on these 
two types to avoid
+     * mistaking a legitimate mixed batch (e.g. a programmatic DROP + RENAME) 
for a replace.
+     *
+     * <p>This operation must be rejected because re-adding columns assigns 
brand-new field ids
+     * while existing data files keep the old ids; same-named columns would 
then be treated as new
+     * columns and read back as null, silently corrupting data.
+     */
+    private boolean isReplaceColumns(TableChange[] changes) {
+        boolean hasDeleteColumn = false;
+        boolean hasAddColumn = false;
+        for (TableChange change : changes) {
+            if (change instanceof TableChange.DeleteColumn) {
+                hasDeleteColumn = true;
+            } else if (change instanceof TableChange.AddColumn) {
+                hasAddColumn = true;
+            } else {
+                return false;
+            }
+        }
+
+        return hasDeleteColumn && hasAddColumn;

Review Comment:
   This heuristic also rejects any programmatic `TableCatalog.alterTable` call 
that batches a supported drop and add together, for example `deleteColumn("b")` 
plus `addColumn("d", ...)`. That is not necessarily `ALTER TABLE ... REPLACE 
COLUMNS` and it used to be a valid combination of existing schema changes. Can 
we make the detection narrower, e.g. only reject Spark\s replace pattern where 
all current top-level columns are deleted before the new columns are added, or 
otherwise avoid blocking ordinary drop+add batches?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [spark] Reject ALTER TABLE REPLACE COLUMNS to avoid silent data corruption [paimon]

Reply via email to