Re: [PR] [feature](routine-load) support fill_missing_columns for json format to auto-complete unspecified columns [doris]

via GitHub Sat, 27 Jun 2026 19:31:31 -0700


github-actions[bot] commented on code in PR #64071:
URL: https://github.com/apache/doris/pull/64071#discussion_r3487233183



##########
fe/fe-core/src/main/java/org/apache/doris/datasource/property/fileformat/JsonFileFormatProperties.java:
##########
@@ -77,6 +79,9 @@ public void analyzeFileFormatProperties(Map<String, String> 
formatProperties, bo
             fuzzyParse = Boolean.valueOf(
                     getOrDefault(formatProperties, PROP_FUZZY_PARSE,
                             "", isRemoveOriginProperty)).booleanValue();
+            fillMissingColumns = Boolean.valueOf(

Review Comment:
   `CREATE ROUTINE LOAD` still does not validate the new boolean the way `ALTER 
ROUTINE LOAD` does. This parser uses `Boolean.valueOf(...)`, so a typo such as 
`"fill_missing_columns" = "treu"` is accepted and persisted as `false` instead 
of failing; `AlterRoutineLoadCommand` rejects the same value explicitly. Since 
this option controls whether missing columns are auto-filled, silently 
disabling it can leave the created job with different behavior than the user 
requested. Please add the same true/false validation on the create path, plus a 
create-path test for an invalid value.



##########
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java:
##########
@@ -195,9 +197,31 @@ private void 
fillContextExprMap(List<NereidsImportColumnDesc> columnDescList, Ne
         // If user does not specify the file field names, generate it by using 
base schema of table.
         // So that the following process can be unified
         boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p 
-> p.isColumn());
-        if (!specifyFileFieldNames) {
+        boolean fillMissing = isFillMissingColumns(fileGroup);
+        if (!specifyFileFieldNames || fillMissing) {
+            // Dedup the base-schema fill against descriptors that already 
provide the matching
+            // file slot, so the existing !specifyFileFieldNames behavior 
stays consistent.
+            // A descriptor only "provides the file slot" for a column when 
either:
+            //   - it is a true file field (expr == null), or
+            //   - it is a mapping that does NOT reference a source column 
with the same name as
+            //     its target (e.g. score_x2 = score * 2). Such a derived 
target consumes other
+            //     columns, so the base descriptor must be suppressed to keep 
the derivation.
+            // A self-referencing mapping such as COLUMNS(k1 = k1) still needs 
the scan to output
+            // its own source column k1, so it must NOT suppress the base 
descriptor; otherwise the
+            // scan would drop k1 while the mapping still references it.
+            Set<String> existingColumns = new 
TreeSet<>(String.CASE_INSENSITIVE_ORDER);
+            if (fillMissing) {
+                for (NereidsImportColumnDesc desc : copiedColumnExprs) {
+                    if (!mappingReferencesOwnColumn(desc)) {

Review Comment:
   This de-dup still misses mappings whose target-column dependency is 
introduced by the Hadoop-function rewrite. For example, with 
`fill_missing_columns=true` and `COLUMNS(score = replace_value(null, 0))`, 
`mappingReferencesOwnColumn()` sees no input slots, so `score` is added to 
`existingColumns` and the base-schema loop does not append the plain `score` 
scan descriptor. Later, `transformHadoopFunctionExpr()` rewrites 
`replace_value` into an `if` expression that reads `new 
UnboundSlot(columnName)`, so the final load tree is effectively 
`LogicalLoadProject(score := if(score is not null, score, 0))` over scan slots 
that no longer include `score`. Please treat `replace_value` as needing the raw 
target source slot, or run the dependency check after that compatibility 
rewrite, and add a regression case for `fill_missing_columns=true` with 
`replace_value`.



##########
regression-test/suites/load_p0/routine_load/test_routine_load_fill_missing_columns.groovy:
##########
@@ -0,0 +1,349 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+import org.apache.kafka.clients.producer.KafkaProducer
+import org.apache.kafka.clients.producer.ProducerRecord
+
+// End-to-end coverage for the json `fill_missing_columns` routine load option.
+// It verifies that a job which declares only a derived column in COLUMNS can 
still load
+// into a table that has a sequence column (the sequence column and other 
base-schema
+// columns are auto-filled), and that the same job with `fill_missing_columns` 
= false
+// keeps the original behavior and fails with "need to specify the sequence 
column".
+suite("test_routine_load_fill_missing_columns", "p0") {
+    String enabled = context.config.otherConfigs.get("enableKafkaTest")
+    String kafka_port = context.config.otherConfigs.get("kafka_port")
+    String externalEnvIp = context.config.otherConfigs.get("externalEnvIp")
+
+    if (enabled == null || !enabled.equalsIgnoreCase("true")) {
+        return
+    }
+
+    def dataFile = "test_routine_load_fill_missing_columns.json"
+    def topic = "test_routine_load_fill_missing_columns"

Review Comment:
   This test topic should be isolated per run. The suite always appends to the 
fixed topic `test_routine_load_fill_missing_columns`, and each routine-load job 
starts from `OFFSET_BEGINNING`; in the last case the table is `DUPLICATE KEY` 
and the test asserts exactly three rows. If the Kafka test environment keeps 
this topic from a prior run, the fresh job reads both the old and new messages 
and the final `rows.size() == 3` assertion can fail even though 
`fill_missing_columns` works. Please use a per-run topic name, or explicitly 
delete/recreate the topic before producing these records.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [feature](routine-load) support fill_missing_columns for json format to auto-complete unspecified columns [doris]

Reply via email to