github-actions[bot] commented on code in PR #63192:
URL: https://github.com/apache/doris/pull/63192#discussion_r3253535327


##########
be/src/format/table/hive_reader.cpp:
##########
@@ -389,7 +389,7 @@ ColumnIdResult 
HiveParquetReader::_create_column_ids_by_top_level_col_index(
 
         // primitive (non-nested) types
         if ((slot->col_type() != TYPE_STRUCT && slot->col_type() != TYPE_ARRAY 
&&
-             slot->col_type() != TYPE_MAP)) {
+             slot->col_type() != TYPE_MAP && slot->col_type() != 
TYPE_VARIANT)) {
             column_ids.insert(field_schema->column_id);

Review Comment:
   Adding `TYPE_VARIANT` here makes VARIANT slots use `process_access_paths()` 
with the `field_schema` selected above by `slot->col_pos()`, but in the 
`hive_parquet_use_column_names=false` path `slot->col_pos()` is the table 
ordinal while the actual file ordinal is `get_scan_params().column_idxs[idx]` 
(the mapping already used when building `table_info_node` at lines 258-270). 
With schema evolution such as a missing/dropped leading table column, a query 
on the first real VARIANT column can prune the second Parquet field instead; 
the needed field ids for the requested VARIANT subpath are not selected, so 
lazy materialization can return missing/null data or read the wrong column. 
Please map the table slot/column name through `column_idxs` before choosing the 
Parquet top-level field for by-position pruning, and add coverage for a 
position-based Hive Parquet scan with a non-identity `column_idxs` mapping and 
a VARIANT access path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to