Re: [PR] [globalindex] Support multi-column GlobalIndex framework [paimon]

via GitHub Tue, 26 May 2026 18:32:17 -0700


jerry-024 commented on code in PR #7933:
URL: https://github.com/apache/paimon/pull/7933#discussion_r3307884833



##########
paimon-core/src/main/java/org/apache/paimon/table/source/FullTextScanImpl.java:
##########
@@ -61,7 +61,17 @@ public Plan scan() {
                     if (globalIndex == null) {
                         return false;
                     }
-                    return textColumn.id() == globalIndex.indexFieldId();
+                    if (textColumn.id() == globalIndex.indexFieldId()) {
+                        return true;
+                    }
+                    if (globalIndex.extraFieldIds() != null) {
+                        for (int id : globalIndex.extraFieldIds()) {
+                            if (textColumn.id() == id) {
+                                return true;
+                            }
+                        }

Review Comment:
   **High risk — read path mismatch**: `FullTextScanImpl` now correctly selects 
multi-column index files via the `extraFieldIds` check added in this PR. 
However, the corresponding **read** path in `FullTextReadImpl.java:72-74` still 
creates the reader with:
   ```java
   GlobalIndexerFactoryUtils.load(indexType).create(textColumn, options)
   ```
   
   This uses the single-column factory method. When the selected index file was 
built with `create(List<DataField>[text, vector], options)`, reading it with a 
single-column reader will either fail to decode the file or produce incorrect 
results.
   
   The scan/filter path and the read path are now inconsistent — scan discovers 
multi-column files, but read doesn't know how to open them. Need to use 
`resolveFields`-style metadata lookup + `create(List<DataField>, Options)` in 
the read path as well.



##########
paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/procedure/CreateGlobalIndexProcedure.java:
##########
@@ -97,7 +106,7 @@ public String[] call(
                 BTreeIndexTopoBuilder.buildIndexAndExecute(
                         procedureContext.getExecutionEnvironment(),
                         table,

Review Comment:
   Still unresolved: BTree and other index types that don't support 
multi-column will silently drop extra columns. When `indexColumns.size() > 1` 
and the factory doesn't override `create(List<DataField>, Options)`, the 
default implementation falls back to `fields.get(0)` without any error.
   
   Suggest: throw `UnsupportedOperationException` in the default 
`GlobalIndexerFactory.create(List<DataField>, Options)` when `fields.size() > 
1`, instead of silently falling back to `create(fields.get(0), options)`. This 
forces implementations to explicitly opt-in to multi-column support.



##########
paimon-core/src/main/java/org/apache/paimon/globalindex/GlobalIndexBuilderUtils.java:
##########
@@ -24,18 +24,26 @@
 import org.apache.paimon.index.GlobalIndexMeta;
 import org.apache.paimon.index.IndexFileMeta;
 import org.apache.paimon.index.IndexPathFactory;
+import org.apache.paimon.manifest.ManifestEntry;
 import org.apache.paimon.options.Options;
+import org.apache.paimon.schema.SchemaManager;
 import org.apache.paimon.table.FileStoreTable;
 import org.apache.paimon.types.DataField;
 import org.apache.paimon.utils.Range;
 
+import javax.annotation.Nullable;
+
 import java.io.IOException;
 import java.util.ArrayList;
+import java.util.HashMap;
 import java.util.List;
+import java.util.Map;

Review Comment:
   **High risk — MERGE path crash**: `MULTI_COLUMN_INDEX_FIELD_ID = -1` breaks 
existing code that calls `rowType.getField(globalIndexMeta.indexFieldId())` 
without guarding against -1:
   
   1. `MergeIntoUpdateChecker.java:104` (Flink): scans index manifest entries 
and does `rowType.getField(globalIndexMeta.indexFieldId())` — will throw when 
encountering a multi-column index.
   2. `MergeIntoPaimonDataEvolutionTable.scala:514` (Spark): same pattern — 
`rowType.getField(globalIndexMeta.indexFieldId()).name()`.
   
   Once a table has a multi-column global index, any `MERGE INTO` that touches 
indexed columns will crash with "Cannot find field by field id: -1".
   
   Fix: these callers need to handle `MULTI_COLUMN_INDEX_FIELD_ID` by reading 
`extraFieldIds()` to get the actual column list.



##########
paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/globalindex/GenericIndexTopoBuilder.java:
##########
@@ -626,16 +648,35 @@ public void processElement(StreamRecord<ShardTask> 
element) throws Exception {
                         }
                         // Only write rows within this shard's range
                         if (currentRowId >= task.shardRange.from) {
-                            Object fieldData = 
indexFieldGetter.getFieldOrNull(row);
-                            if (fieldData == null) {
-                                LOG.info(
-                                        "Null vector at rowId={}, stopping 
shard [{}, {}].",
-                                        currentRowId,
-                                        task.shardRange.from,
-                                        task.shardRange.to);
-                                break;
+                            if (multiColumn) {
+                                boolean hasNull = false;
+                                for (InternalRow.FieldGetter getter : 
indexFieldGetters) {
+                                    if (getter.getFieldOrNull(row) == null) {
+                                        hasNull = true;
+                                        break;
+                                    }
+                                }
+                                if (hasNull) {
+                                    LOG.info(
+                                            "Null value in indexed columns at 
rowId={}, stopping shard [{}, {}].",
+                                            currentRowId,
+                                            task.shardRange.from,
+                                            task.shardRange.to);
+                                    break;
+                                }
+                                ((GlobalIndexMultiColumnWriter) 
indexWriter).write(row);
+                            } else {
+                                Object fieldData = 
indexFieldGetters[0].getFieldOrNull(row);
+                                if (fieldData == null) {
+                                    LOG.info(

Review Comment:
   Still unresolved from last review: multi-column writer receives the full 
projected row including `_ROW_ID`. Need to project down to index-only columns 
before `write(row)`. See `ProjectedRow` approach.



##########
paimon-core/src/main/java/org/apache/paimon/table/source/VectorScanImpl.java:
##########
@@ -82,7 +82,17 @@ public Plan scan() {
                         return false;
                     }
                     int fieldId = globalIndex.indexFieldId();
-                    return vectorColumn.id() == fieldId || 
filterFieldIds.contains(fieldId);
+                    if (vectorColumn.id() == fieldId || 
filterFieldIds.contains(fieldId)) {
+                        return true;
+                    }
+                    if (globalIndex.extraFieldIds() != null) {
+                        for (int id : globalIndex.extraFieldIds()) {
+                            if (vectorColumn.id() == id || 
filterFieldIds.contains(id)) {
+                                return true;
+                            }

Review Comment:
   Same read path issue as FullText: `VectorReadImpl.java:87-89` creates the 
reader with `create(vectorColumn, options)` (single-column), but this scan 
filter now discovers multi-column index files. The read side needs the same 
multi-column awareness.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [globalindex] Support multi-column GlobalIndex framework [paimon]

Reply via email to