github-actions[bot] commented on code in PR #64071:
URL: https://github.com/apache/doris/pull/64071#discussion_r3487233183
##########
fe/fe-core/src/main/java/org/apache/doris/datasource/property/fileformat/JsonFileFormatProperties.java:
##########
@@ -77,6 +79,9 @@ public void analyzeFileFormatProperties(Map<String, String>
formatProperties, bo
fuzzyParse = Boolean.valueOf(
getOrDefault(formatProperties, PROP_FUZZY_PARSE,
"", isRemoveOriginProperty)).booleanValue();
+ fillMissingColumns = Boolean.valueOf(
Review Comment:
`CREATE ROUTINE LOAD` still does not validate the new boolean the way `ALTER
ROUTINE LOAD` does. This parser uses `Boolean.valueOf(...)`, so a typo such as
`"fill_missing_columns" = "treu"` is accepted and persisted as `false` instead
of failing; `AlterRoutineLoadCommand` rejects the same value explicitly. Since
this option controls whether missing columns are auto-filled, silently
disabling it can leave the created job with different behavior than the user
requested. Please add the same true/false validation on the create path, plus a
create-path test for an invalid value.
##########
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java:
##########
@@ -195,9 +197,31 @@ private void
fillContextExprMap(List<NereidsImportColumnDesc> columnDescList, Ne
// If user does not specify the file field names, generate it by using
base schema of table.
// So that the following process can be unified
boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p
-> p.isColumn());
- if (!specifyFileFieldNames) {
+ boolean fillMissing = isFillMissingColumns(fileGroup);
+ if (!specifyFileFieldNames || fillMissing) {
+ // Dedup the base-schema fill against descriptors that already
provide the matching
+ // file slot, so the existing !specifyFileFieldNames behavior
stays consistent.
+ // A descriptor only "provides the file slot" for a column when
either:
+ // - it is a true file field (expr == null), or
+ // - it is a mapping that does NOT reference a source column
with the same name as
+ // its target (e.g. score_x2 = score * 2). Such a derived
target consumes other
+ // columns, so the base descriptor must be suppressed to keep
the derivation.
+ // A self-referencing mapping such as COLUMNS(k1 = k1) still needs
the scan to output
+ // its own source column k1, so it must NOT suppress the base
descriptor; otherwise the
+ // scan would drop k1 while the mapping still references it.
+ Set<String> existingColumns = new
TreeSet<>(String.CASE_INSENSITIVE_ORDER);
+ if (fillMissing) {
+ for (NereidsImportColumnDesc desc : copiedColumnExprs) {
+ if (!mappingReferencesOwnColumn(desc)) {
Review Comment:
This de-dup still misses mappings whose target-column dependency is
introduced by the Hadoop-function rewrite. For example, with
`fill_missing_columns=true` and `COLUMNS(score = replace_value(null, 0))`,
`mappingReferencesOwnColumn()` sees no input slots, so `score` is added to
`existingColumns` and the base-schema loop does not append the plain `score`
scan descriptor. Later, `transformHadoopFunctionExpr()` rewrites
`replace_value` into an `if` expression that reads `new
UnboundSlot(columnName)`, so the final load tree is effectively
`LogicalLoadProject(score := if(score is not null, score, 0))` over scan slots
that no longer include `score`. Please treat `replace_value` as needing the raw
target source slot, or run the dependency check after that compatibility
rewrite, and add a regression case for `fill_missing_columns=true` with
`replace_value`.
##########
regression-test/suites/load_p0/routine_load/test_routine_load_fill_missing_columns.groovy:
##########
@@ -0,0 +1,349 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+import org.apache.kafka.clients.producer.KafkaProducer
+import org.apache.kafka.clients.producer.ProducerRecord
+
+// End-to-end coverage for the json `fill_missing_columns` routine load option.
+// It verifies that a job which declares only a derived column in COLUMNS can
still load
+// into a table that has a sequence column (the sequence column and other
base-schema
+// columns are auto-filled), and that the same job with `fill_missing_columns`
= false
+// keeps the original behavior and fails with "need to specify the sequence
column".
+suite("test_routine_load_fill_missing_columns", "p0") {
+ String enabled = context.config.otherConfigs.get("enableKafkaTest")
+ String kafka_port = context.config.otherConfigs.get("kafka_port")
+ String externalEnvIp = context.config.otherConfigs.get("externalEnvIp")
+
+ if (enabled == null || !enabled.equalsIgnoreCase("true")) {
+ return
+ }
+
+ def dataFile = "test_routine_load_fill_missing_columns.json"
+ def topic = "test_routine_load_fill_missing_columns"
Review Comment:
This test topic should be isolated per run. The suite always appends to the
fixed topic `test_routine_load_fill_missing_columns`, and each routine-load job
starts from `OFFSET_BEGINNING`; in the last case the table is `DUPLICATE KEY`
and the test asserts exactly three rows. If the Kafka test environment keeps
this topic from a prior run, the fresh job reads both the old and new messages
and the final `rows.size() == 3` assertion can fail even though
`fill_missing_columns` works. Please use a per-run topic name, or explicitly
delete/recreate the topic before producing these records.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]