[Impala-ASF-CR] IMPALA-9874: Skip IO for late materialized columns

Xuebin Su (Code Review) Tue, 08 Jul 2025 02:46:07 -0700

Xuebin Su has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23012 )


Change subject: IMPALA-9874: Skip IO for late materialized columns
......................................................................


Patch Set 8:

(5 comments)

> Patch Set 6:
>
> (5 comments)

Thanks for reviewing!

http://gerrit.cloudera.org:8080/#/c/23012/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23012/6//COMMIT_MSG@7
PS6, Line 7: IMPALA-9874: Skip IO for late materialized columns
           :
           : Previously, IO skipping only worked with statistics and dictionarie
> The scope is not clear - did we skip IO for skipped non-filter data pages e
Thanks! Added a description in the commit message.


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h@484
PS6, Line 484: s to 'non_dict_filterable_filter_reader
> The name looks redundant to me, as non-filter reader is never dict_filterab
Thanks! Changed.


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h@878
PS6, Line 878:   /// are flattened into complex_readers_. All scalar readers 
are also flattened into
             :   /// scalar_readers_.
             :   /// The scalar readers are further partitioned into
             :   /// - dict_filterable_readers_,
> As I wrote in the previous comments, I think that it would be clearer to pa
Thanks! I think partitioning readers into filter/non-filter is done in 
DivideFilterAndNonFilterColumnReaders(), which is moved before 
PartitionReaders() in this patch.


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc@1093
PS6, Line 1093:   // Return OK if we have already tried reading the dictionary 
page.
              :   if (col_chunk_reader_.PageHeadersRead() > 0) {
              :     return Status::OK();
              :   }
> When can this happen?
I think this can happen when we need more than one scratch batch or row batch 
for the output rows because InitDictionary() will be called in each iteration 
of the while loop in AssembleRows().


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc@1423
PS6, Line 1423:   if (col_chunk_reader_.PageHeadersRead() == 0) {
              :     def_level_ = ParquetLevel::ROW_GROUP_END;
              :     rep_level_ = ParquetLevel::ROW_GROUP_END;
              :     return true;
              :   }
> I am still trying to understand what the original code is doing here. AFAIK
Thanks! I agree that it is about error checking. I think it checks whether we 
have read all the pages by trying to read the next page.

When we skip IO completely, it sets rep_level_ so that RowGroupAtEnd() returns 
true.



--
To view, visit http://gerrit.cloudera.org:8080/23012
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d
Gerrit-Change-Number: 23012
Gerrit-PatchSet: 8
Gerrit-Owner: Xuebin Su <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Xuebin Su <[email protected]>
Gerrit-Comment-Date: Tue, 08 Jul 2025 09:45:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9874: Skip IO for late materialized columns

Reply via email to