github-actions[bot] commented on code in PR #64071:
URL: https://github.com/apache/doris/pull/64071#discussion_r3479856151
##########
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java:
##########
@@ -195,9 +196,29 @@ private void
fillContextExprMap(List<NereidsImportColumnDesc> columnDescList, Ne
// If user does not specify the file field names, generate it by using
base schema of table.
// So that the following process can be unified
boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p
-> p.isColumn());
- if (!specifyFileFieldNames) {
+ boolean fillMissing = isFillMissingColumns(fileGroup);
+ if (!specifyFileFieldNames || fillMissing) {
+ // The fill_missing_columns path may enter this branch even when
the user already
+ // specified some plain file fields, so dedup the base-schema
descriptors we add against
+ // columns that are already present as plain file fields (expr ==
null) to avoid emitting
+ // duplicate scan slots. We deliberately do NOT dedup mapping
targets (expr != null): a
+ // mapping such as "k1 = k1" still needs the base scan descriptor
to provide its source
+ // slot, and constant mappings are already skipped via
constantMappingColumns below. In
+ // the legacy !specifyFileFieldNames path there are no plain file
fields, so this set
+ // stays empty and that behavior remains byte-for-byte identical.
+ Set<String> existingColumns = new
TreeSet<>(String.CASE_INSENSITIVE_ORDER);
+ if (fillMissing) {
+ for (NereidsImportColumnDesc desc : copiedColumnExprs) {
Review Comment:
Please don't add a raw scan descriptor for a mapped destination unless the
mapping expression actually reads that destination slot. With
`fill_missing_columns`, `existingColumns` only contains plain file fields and
`constantMappingColumns` skips only slotless expressions. For `COLUMNS(k1, v =
k1 + 1)` into `(k1, v NOT NULL, w)` with JSON rows containing only `k1`, the
base-schema loop appends a plain `v` descriptor even though the mapping
computes `v` from `k1`. The FE plan then has raw source slots `[k1, v, w]`, but
`NereidsLoadPlanInfoCollector.visitLogicalOneRowRelation()` skips default
metadata for source slot `v` because `exprMap` contains `v`; BE
`NewJsonReader::_fill_missing_column()` reports the missing non-null `v` field
before `LogicalLoadProject` can produce `v = k1 + 1`. Same-name mappings like
`v = v + 1` still need the raw source slot, but mappings that don't read `v`
should not force the JSON field to exist.
##########
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java:
##########
@@ -195,9 +196,29 @@ private void
fillContextExprMap(List<NereidsImportColumnDesc> columnDescList, Ne
// If user does not specify the file field names, generate it by using
base schema of table.
// So that the following process can be unified
boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p
-> p.isColumn());
- if (!specifyFileFieldNames) {
+ boolean fillMissing = isFillMissingColumns(fileGroup);
+ if (!specifyFileFieldNames || fillMissing) {
+ // The fill_missing_columns path may enter this branch even when
the user already
+ // specified some plain file fields, so dedup the base-schema
descriptors we add against
+ // columns that are already present as plain file fields (expr ==
null) to avoid emitting
+ // duplicate scan slots. We deliberately do NOT dedup mapping
targets (expr != null): a
+ // mapping such as "k1 = k1" still needs the base scan descriptor
to provide its source
+ // slot, and constant mappings are already skipped via
constantMappingColumns below. In
+ // the legacy !specifyFileFieldNames path there are no plain file
fields, so this set
+ // stays empty and that behavior remains byte-for-byte identical.
+ Set<String> existingColumns = new
TreeSet<>(String.CASE_INSENSITIVE_ORDER);
+ if (fillMissing) {
+ for (NereidsImportColumnDesc desc : copiedColumnExprs) {
+ if (desc.isColumn()) {
+ existingColumns.add(desc.getColumnName());
+ }
+ }
+ }
List<Column> columns = tbl.getBaseSchema(false);
for (Column column : columns) {
+ if (existingColumns.contains(column.getName())) {
Review Comment:
Adding the plain descriptor for same-name mappings also needs to preserve
the destination mapping expression. With `fill_missing_columns=true`, a
self-referencing mapping such as `COLUMNS(k1, v = v + 1)` appends a plain `v`
descriptor after the mapping descriptor. The later
`columnExprMap.put(columnName, expr)` loop keeps only the last descriptor, so
`v` maps to `null` and the schema loop no longer runs
`transformHadoopFunctionExpr()`, HLL validation, or the `NEGATIVE` `SUM`
rewrite for the mapped destination. The scan-slot loop later re-adds the
original expression, but at that point those schema-loop transformations have
already been skipped. Keep the raw source descriptor for same-name mappings,
but do not let it erase the mapping expression used for destination-column
processing.
##########
fe/fe-core/src/main/java/org/apache/doris/nereids/load/NereidsLoadScanProvider.java:
##########
@@ -195,9 +196,29 @@ private void
fillContextExprMap(List<NereidsImportColumnDesc> columnDescList, Ne
// If user does not specify the file field names, generate it by using
base schema of table.
// So that the following process can be unified
boolean specifyFileFieldNames = copiedColumnExprs.stream().anyMatch(p
-> p.isColumn());
- if (!specifyFileFieldNames) {
+ boolean fillMissing = isFillMissingColumns(fileGroup);
+ if (!specifyFileFieldNames || fillMissing) {
Review Comment:
The fill-missing descriptors need to be inserted before the `COLUMNS FROM
PATH` descriptors, or the scan params need an explicit path-slot
classification. `NereidsDataDescription.analyzeColumns()` builds the source
list as file fields followed by path columns, and
`NereidsLoadPlanInfoCollector.toFileScanRangeParams()` still derives
`numColumnsFromFile = srcSlotIds.size() - columnCountFromPath` and marks the
last N slots as path-derived. With `fill_missing_columns=true`, this branch
appends omitted base-schema descriptors after the existing list, so a plan like
`COLUMNS(k1) COLUMNS FROM PATH AS (dt)` into `(k1, dt, v)` becomes source slots
`[k1, dt(path), v(auto)]`; FE now marks `dt` as a file slot and `v` as the path
suffix. That means the path value is not loaded into `dt`, and the auto-filled
column can be interpreted as the path slot. Please preserve the invariant that
path columns remain the final source slots when adding missing table columns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]