Xuebin Su has posted comments on this change. ( http://gerrit.cloudera.org:8080/23012 )
Change subject: IMPALA-9874: Skip IO for late materialized columns ...................................................................... Patch Set 8: (5 comments) > Patch Set 6: > > (5 comments) Thanks for reviewing! http://gerrit.cloudera.org:8080/#/c/23012/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23012/6//COMMIT_MSG@7 PS6, Line 7: IMPALA-9874: Skip IO for late materialized columns : : Previously, IO skipping only worked with statistics and dictionarie > The scope is not clear - did we skip IO for skipped non-filter data pages e Thanks! Added a description in the commit message. http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h@484 PS6, Line 484: s to 'non_dict_filterable_filter_reader > The name looks redundant to me, as non-filter reader is never dict_filterab Thanks! Changed. http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/hdfs-parquet-scanner.h@878 PS6, Line 878: /// are flattened into complex_readers_. All scalar readers are also flattened into : /// scalar_readers_. : /// The scalar readers are further partitioned into : /// - dict_filterable_readers_, > As I wrote in the previous comments, I think that it would be clearer to pa Thanks! I think partitioning readers into filter/non-filter is done in DivideFilterAndNonFilterColumnReaders(), which is moved before PartitionReaders() in this patch. http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc@1093 PS6, Line 1093: // Return OK if we have already tried reading the dictionary page. : if (col_chunk_reader_.PageHeadersRead() > 0) { : return Status::OK(); : } > When can this happen? I think this can happen when we need more than one scratch batch or row batch for the output rows because InitDictionary() will be called in each iteration of the while loop in AssembleRows(). http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc@1423 PS6, Line 1423: if (col_chunk_reader_.PageHeadersRead() == 0) { : def_level_ = ParquetLevel::ROW_GROUP_END; : rep_level_ = ParquetLevel::ROW_GROUP_END; : return true; : } > I am still trying to understand what the original code is doing here. AFAIK Thanks! I agree that it is about error checking. I think it checks whether we have read all the pages by trying to read the next page. When we skip IO completely, it sets rep_level_ so that RowGroupAtEnd() returns true. -- To view, visit http://gerrit.cloudera.org:8080/23012 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d Gerrit-Change-Number: 23012 Gerrit-PatchSet: 8 Gerrit-Owner: Xuebin Su <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Xuebin Su <[email protected]> Gerrit-Comment-Date: Tue, 08 Jul 2025 09:45:32 +0000 Gerrit-HasComments: Yes
