Re: [PR] [feature](routine-load) support fill_missing_columns for json format to auto-complete unspecified columns [doris]

via GitHub Fri, 26 Jun 2026 00:35:23 -0700


github-actions[bot] commented on code in PR #64071:
URL: https://github.com/apache/doris/pull/64071#discussion_r3479856151



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java:
##########
@@ -195,9 +196,29 @@ private void 
fillContextExprMap(List<NereidsImportColumnDesc> columnDescList, Ne
         // If user does not specify the file field names, generate it by using 
base schema of table.
         // So that the following process can be unified
         boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p 
-> p.isColumn());
-        if (!specifyFileFieldNames) {
+        boolean fillMissing = isFillMissingColumns(fileGroup);
+        if (!specifyFileFieldNames || fillMissing) {
+            // The fill_missing_columns path may enter this branch even when 
the user already
+            // specified some plain file fields, so dedup the base-schema 
descriptors we add against
+            // columns that are already present as plain file fields (expr == 
null) to avoid emitting
+            // duplicate scan slots. We deliberately do NOT dedup mapping 
targets (expr != null): a
+            // mapping such as "k1 = k1" still needs the base scan descriptor 
to provide its source
+            // slot, and constant mappings are already skipped via 
constantMappingColumns below. In
+            // the legacy !specifyFileFieldNames path there are no plain file 
fields, so this set
+            // stays empty and that behavior remains byte-for-byte identical.
+            Set<String> existingColumns = new 
TreeSet<>(String.CASE_INSENSITIVE_ORDER);
+            if (fillMissing) {
+                for (NereidsImportColumnDesc desc : copiedColumnExprs) {

Review Comment:
   Please don't add a raw scan descriptor for a mapped destination unless the 
mapping expression actually reads that destination slot. With 
`fill_missing_columns`, `existingColumns` only contains plain file fields and 
`constantMappingColumns` skips only slotless expressions. For `COLUMNS(k1, v = 
k1 + 1)` into `(k1, v NOT NULL, w)` with JSON rows containing only `k1`, the 
base-schema loop appends a plain `v` descriptor even though the mapping 
computes `v` from `k1`. The FE plan then has raw source slots `[k1, v, w]`, but 
`NereidsLoadPlanInfoCollector.visitLogicalOneRowRelation()` skips default 
metadata for source slot `v` because `exprMap` contains `v`; BE 
`NewJsonReader::_fill_missing_column()` reports the missing non-null `v` field 
before `LogicalLoadProject` can produce `v = k1 + 1`. Same-name mappings like 
`v = v + 1` still need the raw source slot, but mappings that don't read `v` 
should not force the JSON field to exist.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java:
##########
@@ -195,9 +196,29 @@ private void 
fillContextExprMap(List<NereidsImportColumnDesc> columnDescList, Ne
         // If user does not specify the file field names, generate it by using 
base schema of table.
         // So that the following process can be unified
         boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p 
-> p.isColumn());
-        if (!specifyFileFieldNames) {
+        boolean fillMissing = isFillMissingColumns(fileGroup);
+        if (!specifyFileFieldNames || fillMissing) {
+            // The fill_missing_columns path may enter this branch even when 
the user already
+            // specified some plain file fields, so dedup the base-schema 
descriptors we add against
+            // columns that are already present as plain file fields (expr == 
null) to avoid emitting
+            // duplicate scan slots. We deliberately do NOT dedup mapping 
targets (expr != null): a
+            // mapping such as "k1 = k1" still needs the base scan descriptor 
to provide its source
+            // slot, and constant mappings are already skipped via 
constantMappingColumns below. In
+            // the legacy !specifyFileFieldNames path there are no plain file 
fields, so this set
+            // stays empty and that behavior remains byte-for-byte identical.
+            Set<String> existingColumns = new 
TreeSet<>(String.CASE_INSENSITIVE_ORDER);
+            if (fillMissing) {
+                for (NereidsImportColumnDesc desc : copiedColumnExprs) {
+                    if (desc.isColumn()) {
+                        existingColumns.add(desc.getColumnName());
+                    }
+                }
+            }
             List<Column> columns = tbl.getBaseSchema(false);
             for (Column column : columns) {
+                if (existingColumns.contains(column.getName())) {

Review Comment:
   Adding the plain descriptor for same-name mappings also needs to preserve 
the destination mapping expression. With `fill_missing_columns=true`, a 
self-referencing mapping such as `COLUMNS(k1, v = v + 1)` appends a plain `v` 
descriptor after the mapping descriptor. The later 
`columnExprMap.put(columnName, expr)` loop keeps only the last descriptor, so 
`v` maps to `null` and the schema loop no longer runs 
`transformHadoopFunctionExpr()`, HLL validation, or the `NEGATIVE` `SUM` 
rewrite for the mapped destination. The scan-slot loop later re-adds the 
original expression, but at that point those schema-loop transformations have 
already been skipped. Keep the raw source descriptor for same-name mappings, 
but do not let it erase the mapping expression used for destination-column 
processing.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java:
##########
@@ -195,9 +196,29 @@ private void 
fillContextExprMap(List<NereidsImportColumnDesc> columnDescList, Ne
         // If user does not specify the file field names, generate it by using 
base schema of table.
         // So that the following process can be unified
         boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p 
-> p.isColumn());
-        if (!specifyFileFieldNames) {
+        boolean fillMissing = isFillMissingColumns(fileGroup);
+        if (!specifyFileFieldNames || fillMissing) {

Review Comment:
   The fill-missing descriptors need to be inserted before the `COLUMNS FROM 
PATH` descriptors, or the scan params need an explicit path-slot 
classification. `NereidsDataDescription.analyzeColumns()` builds the source 
list as file fields followed by path columns, and 
`NereidsLoadPlanInfoCollector.toFileScanRangeParams()` still derives 
`numColumnsFromFile = srcSlotIds.size() - columnCountFromPath` and marks the 
last N slots as path-derived. With `fill_missing_columns=true`, this branch 
appends omitted base-schema descriptors after the existing list, so a plan like 
`COLUMNS(k1) COLUMNS FROM PATH AS (dt)` into `(k1, dt, v)` becomes source slots 
`[k1, dt(path), v(auto)]`; FE now marks `dt` as a file slot and `v` as the path 
suffix. That means the path value is not loaded into `dt`, and the auto-filled 
column can be interpreted as the path slot. Please preserve the invariant that 
path columns remain the final source slots when adding missing table columns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [feature](routine-load) support fill_missing_columns for json format to auto-complete unspecified columns [doris]

Reply via email to