[Impala-ASF-CR] IMPALA-9874: Skip IO for late materialized columns

Csaba Ringhofer (Code Review) Wed, 18 Jun 2025 07:16:22 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23012 )


Change subject: IMPALA-9874: Skip IO for late materialized columns
......................................................................


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/23012/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23012/6//COMMIT_MSG@7
PS6, Line 7: IMPALA-9874: Skip IO for late materialized columns
           :
           : This patch implements IO skipping at column chunk level for Parquet
The scope is not clear - did we skip IO for skipped non-filter data pages even 
before this patch, and the patch only affects dictionary reading + read ahead 
of data pages? Or all pages were read before that patch? Or just the headers? 
If I remember correctly, only headers were read, with the exception of 
collection types, where decompressing the data page is need for rep/def levels 
in case of v1 headers.


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h@484
PS6, Line 484: non_dict_filterable_non_filter_readers_
The name looks redundant to me, as non-filter reader is never dict_filterable. 
Could we have one vector for each reader set in the order of starting scanning 
pages?
1. dict_filterable_readers (started before other filter readers to be able to 
skip those in case the dictionary has no match)
2. non_dict_filterable_filter_readers
3. non_filter_readers


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h@878
PS6, Line 878:   /// dictionary filtering is enabled and the reader can be 
dictionary filtered.
             :   /// non_dict_filterable_readers_ are further partitioned into
             :   /// non_dict_filterable_filter_readers_ and 
non_dict_filterable_non_filter_readers_
             :   /// depending on whehter the reader belongs to a filter reader 
or a non-filter reader.
As I wrote in the previous comments, I think that it would be clearer to 
partition into filter/non-filter readers and then further partition filter 
readers to dict filterable/non dict filterable


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc@1093
PS6, Line 1093:   // Return OK if we have already tried reading the dictionary 
page.
              :   if (col_chunk_reader_.PageHeadersRead() > 0) {
              :     return Status::OK();
              :   }
When can this happen?


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc@1423
PS6, Line 1423:   if (col_chunk_reader_.PageHeadersRead() == 0) {
              :     def_level_ = ParquetLevel::ROW_GROUP_END;
              :     rep_level_ = ParquetLevel::ROW_GROUP_END;
              :     return true;
              :   }
I am still trying to understand what the original code is doing here. AFAIK 
SetRowGroupAtEnd is not really needed to read the files, but is about error 
checking - we read through all columns and expect them to end at the same row 
num.
If we skip IO completely, this type of checking is not possible. Maybe we could 
remove this function completely, or just set it to some finished state without 
progressing with columns under any circumstances?



--
To view, visit http://gerrit.cloudera.org:8080/23012
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d
Gerrit-Change-Number: 23012
Gerrit-PatchSet: 6
Gerrit-Owner: Xuebin Su <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Xuebin Su <[email protected]>
Gerrit-Comment-Date: Wed, 18 Jun 2025 14:14:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9874: Skip IO for late materialized columns

Reply via email to