github-actions[bot] commented on code in PR #63192:
URL: https://github.com/apache/doris/pull/63192#discussion_r3233406941


##########
be/src/format/parquet/vparquet_reader.cpp:
##########
@@ -432,11 +435,111 @@ Status 
ParquetReader::on_before_init_reader(ReaderInitContext* ctx) {
         RETURN_IF_ERROR(get_file_metadata_schema(&field_desc));
         
RETURN_IF_ERROR(TableSchemaChangeHelper::BuildTableInfoUtil::by_parquet_name(
                 ctx->tuple_descriptor, *field_desc, ctx->table_info_node));
+        auto column_id_result = _create_column_ids_by_name(field_desc, 
ctx->tuple_descriptor);
+        ctx->column_ids = std::move(column_id_result.column_ids);
+        ctx->filter_column_ids = std::move(column_id_result.filter_column_ids);
     }
 
     return Status::OK();
 }
 
+ColumnIdResult ParquetReader::_create_column_ids_by_name(const 
FieldDescriptor* field_desc,
+                                                         const 
TupleDescriptor* tuple_descriptor) {
+    auto* mutable_field_desc = const_cast<FieldDescriptor*>(field_desc);
+    mutable_field_desc->assign_ids();
+
+    std::unordered_map<std::string, const FieldSchema*> 
table_col_name_to_field_schema_map;
+    for (int i = 0; i < field_desc->size(); ++i) {
+        auto field_schema = field_desc->get_column(i);
+        if (!field_schema) {
+            continue;
+        }
+        table_col_name_to_field_schema_map[field_schema->lower_case_name] = 
field_schema;
+    }
+
+    std::set<uint64_t> column_ids;
+    std::set<uint64_t> filter_column_ids;
+
+    auto process_access_paths = [](const FieldSchema* parquet_field,
+                                   const std::vector<TColumnAccessPath>& 
access_paths,
+                                   std::set<uint64_t>& out_ids) {
+        process_nested_access_paths(
+                parquet_field, access_paths, out_ids,
+                [](const FieldSchema* field) { return field->get_column_id(); 
},
+                [](const FieldSchema* field) { return 
field->get_max_column_id(); },
+                HiveParquetNestedColumnUtils::extract_nested_column_ids);
+    };
+
+    for (const auto* slot : tuple_descriptor->slots()) {
+        auto it = 
table_col_name_to_field_schema_map.find(slot->col_name_lower_case());
+        if (it == table_col_name_to_field_schema_map.end()) {
+            continue;
+        }
+        auto field_schema = it->second;
+

Review Comment:
   This branch now treats `TYPE_VARIANT` like a nested-pruned complex type, but 
a full VARIANT projection has no access path: `AccessPathExpressionCollector` 
only records variant paths when a subcolumn is accessed. In a query like 
`SELECT id, v FROM local(...)`, the scalar `id` inserts one column id, so 
`_column_ids` is non-empty; the full `v` slot has empty `all_access_paths()`, 
`process_access_paths()` inserts nothing for it, and the downstream 
readers/profile collection prune all `v` leaves. Please add the full variant 
field range when `all_access_paths()` is empty (or make FE emit an explicit 
root access path) and add coverage for selecting the full VARIANT column 
together with another column. This is distinct from the existing shredded 
subpath threads because it affects full-slot projection rather than 
residual/typed-value reconstruction.



##########
be/src/format/table/hive_reader.cpp:
##########
@@ -328,7 +328,7 @@ ColumnIdResult HiveParquetReader::_create_column_ids(const 
FieldDescriptor* fiel
 
         // primitive (non-nested) types
         if ((slot->col_type() != TYPE_STRUCT && slot->col_type() != TYPE_ARRAY 
&&

Review Comment:
   Same full-slot pruning gap in the Hive Parquet path: after adding 
`TYPE_VARIANT` to the complex branch, a selected VARIANT column with no subpath 
contributes no column ids. If another projected/predicate column makes 
`column_ids` non-empty, the Hive Parquet reader will skip the entire VARIANT 
payload for queries such as `SELECT id, v FROM hive_table`. Please select the 
full field range when `all_access_paths()` is empty, and apply the same fix to 
the top-level-column-index helper below. This is a separate reader path from 
the standalone local Parquet path.



##########
be/src/format/table/iceberg_reader.cpp:
##########
@@ -344,7 +344,7 @@ ColumnIdResult 
IcebergParquetReader::_create_column_ids(const FieldDescriptor* f
         auto field_schema = it->second;
 
         if ((slot->col_type() != TYPE_STRUCT && slot->col_type() != TYPE_ARRAY 
&&
-             slot->col_type() != TYPE_MAP)) {
+             slot->col_type() != TYPE_MAP && slot->col_type() != 
TYPE_VARIANT)) {

Review Comment:
   The Iceberg path has the same full VARIANT projection issue. A full slot 
read has empty `all_access_paths()`, so once `TYPE_VARIANT` bypasses the 
primitive branch it adds no ids for `v`; with any other selected Iceberg 
column, `column_ids` is non-empty and all VARIANT leaves are pruned. Please add 
the full field range for empty access paths (or emit a root access path from 
FE) and cover `SELECT id, v` on an Iceberg VARIANT file. This is distinct from 
the existing Iceberg subpath/residual pruning threads because no VARIANT 
subpath is requested here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to