jerry-024 commented on code in PR #7933:
URL: https://github.com/apache/paimon/pull/7933#discussion_r3307884833
##########
paimon-core/src/main/java/org/apache/paimon/table/source/FullTextScanImpl.java:
##########
@@ -61,7 +61,17 @@ public Plan scan() {
if (globalIndex == null) {
return false;
}
- return textColumn.id() == globalIndex.indexFieldId();
+ if (textColumn.id() == globalIndex.indexFieldId()) {
+ return true;
+ }
+ if (globalIndex.extraFieldIds() != null) {
+ for (int id : globalIndex.extraFieldIds()) {
+ if (textColumn.id() == id) {
+ return true;
+ }
+ }
Review Comment:
**High risk — read path mismatch**: `FullTextScanImpl` now correctly selects
multi-column index files via the `extraFieldIds` check added in this PR.
However, the corresponding **read** path in `FullTextReadImpl.java:72-74` still
creates the reader with:
```java
GlobalIndexerFactoryUtils.load(indexType).create(textColumn, options)
```
This uses the single-column factory method. When the selected index file was
built with `create(List<DataField>[text, vector], options)`, reading it with a
single-column reader will either fail to decode the file or produce incorrect
results.
The scan/filter path and the read path are now inconsistent — scan discovers
multi-column files, but read doesn't know how to open them. Need to use
`resolveFields`-style metadata lookup + `create(List<DataField>, Options)` in
the read path as well.
##########
paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/procedure/CreateGlobalIndexProcedure.java:
##########
@@ -97,7 +106,7 @@ public String[] call(
BTreeIndexTopoBuilder.buildIndexAndExecute(
procedureContext.getExecutionEnvironment(),
table,
Review Comment:
Still unresolved: BTree and other index types that don't support
multi-column will silently drop extra columns. When `indexColumns.size() > 1`
and the factory doesn't override `create(List<DataField>, Options)`, the
default implementation falls back to `fields.get(0)` without any error.
Suggest: throw `UnsupportedOperationException` in the default
`GlobalIndexerFactory.create(List<DataField>, Options)` when `fields.size() >
1`, instead of silently falling back to `create(fields.get(0), options)`. This
forces implementations to explicitly opt-in to multi-column support.
##########
paimon-core/src/main/java/org/apache/paimon/globalindex/GlobalIndexBuilderUtils.java:
##########
@@ -24,18 +24,26 @@
import org.apache.paimon.index.GlobalIndexMeta;
import org.apache.paimon.index.IndexFileMeta;
import org.apache.paimon.index.IndexPathFactory;
+import org.apache.paimon.manifest.ManifestEntry;
import org.apache.paimon.options.Options;
+import org.apache.paimon.schema.SchemaManager;
import org.apache.paimon.table.FileStoreTable;
import org.apache.paimon.types.DataField;
import org.apache.paimon.utils.Range;
+import javax.annotation.Nullable;
+
import java.io.IOException;
import java.util.ArrayList;
+import java.util.HashMap;
import java.util.List;
+import java.util.Map;
Review Comment:
**High risk — MERGE path crash**: `MULTI_COLUMN_INDEX_FIELD_ID = -1` breaks
existing code that calls `rowType.getField(globalIndexMeta.indexFieldId())`
without guarding against -1:
1. `MergeIntoUpdateChecker.java:104` (Flink): scans index manifest entries
and does `rowType.getField(globalIndexMeta.indexFieldId())` — will throw when
encountering a multi-column index.
2. `MergeIntoPaimonDataEvolutionTable.scala:514` (Spark): same pattern —
`rowType.getField(globalIndexMeta.indexFieldId()).name()`.
Once a table has a multi-column global index, any `MERGE INTO` that touches
indexed columns will crash with "Cannot find field by field id: -1".
Fix: these callers need to handle `MULTI_COLUMN_INDEX_FIELD_ID` by reading
`extraFieldIds()` to get the actual column list.
##########
paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/globalindex/GenericIndexTopoBuilder.java:
##########
@@ -626,16 +648,35 @@ public void processElement(StreamRecord<ShardTask>
element) throws Exception {
}
// Only write rows within this shard's range
if (currentRowId >= task.shardRange.from) {
- Object fieldData =
indexFieldGetter.getFieldOrNull(row);
- if (fieldData == null) {
- LOG.info(
- "Null vector at rowId={}, stopping
shard [{}, {}].",
- currentRowId,
- task.shardRange.from,
- task.shardRange.to);
- break;
+ if (multiColumn) {
+ boolean hasNull = false;
+ for (InternalRow.FieldGetter getter :
indexFieldGetters) {
+ if (getter.getFieldOrNull(row) == null) {
+ hasNull = true;
+ break;
+ }
+ }
+ if (hasNull) {
+ LOG.info(
+ "Null value in indexed columns at
rowId={}, stopping shard [{}, {}].",
+ currentRowId,
+ task.shardRange.from,
+ task.shardRange.to);
+ break;
+ }
+ ((GlobalIndexMultiColumnWriter)
indexWriter).write(row);
+ } else {
+ Object fieldData =
indexFieldGetters[0].getFieldOrNull(row);
+ if (fieldData == null) {
+ LOG.info(
Review Comment:
Still unresolved from last review: multi-column writer receives the full
projected row including `_ROW_ID`. Need to project down to index-only columns
before `write(row)`. See `ProjectedRow` approach.
##########
paimon-core/src/main/java/org/apache/paimon/table/source/VectorScanImpl.java:
##########
@@ -82,7 +82,17 @@ public Plan scan() {
return false;
}
int fieldId = globalIndex.indexFieldId();
- return vectorColumn.id() == fieldId ||
filterFieldIds.contains(fieldId);
+ if (vectorColumn.id() == fieldId ||
filterFieldIds.contains(fieldId)) {
+ return true;
+ }
+ if (globalIndex.extraFieldIds() != null) {
+ for (int id : globalIndex.extraFieldIds()) {
+ if (vectorColumn.id() == id ||
filterFieldIds.contains(id)) {
+ return true;
+ }
Review Comment:
Same read path issue as FullText: `VectorReadImpl.java:87-89` creates the
reader with `create(vectorColumn, options)` (single-column), but this scan
filter now discovers multi-column index files. The read side needs the same
multi-column awareness.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]