Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23012 )

Change subject: IMPALA-9874: Skip IO for late materialized columns
......................................................................


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/23012/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/23012/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@259
PS8, Line 259:   late_materialization_ = !(filter_readers_.empty() || 
non_filter_readers_.empty()
             :       || late_materialization_threshold_ < 0 || 
filter_readers_[0]->max_rep_level() > 0
             :       || HasStructColumnReader(non_filter_readers_));
optional: This complex condition + the logic in line 269-274 could have its own 
function, e.g. InitLateMaterialization()

Even if it stays here, some comment would be nice, e.g. 
filter_readers_[0]->max_rep_level() > 0 is not clear to me - when is this tre, 
why do we check only the first filter_reader?


http://gerrit.cloudera.org:8080/#/c/23012/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2464
PS8, Line 2464:     if (rows_selected_in_row_group_) {
              :       
RETURN_IF_ERROR(BaseScalarColumnReader::StartScans(non_filter_scalar_readers_));
              :       RETURN_IF_ERROR(
              :           
BaseScalarColumnReader::InitDictionaries(non_filter_scalar_readers_));
              :     }
I would prefer to have an extra bool (e.g. non_filter_scalar_readers_started_) 
to avoid calling these instead of handling double calling inside the functions


http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/23012/6/be/src/exec/parquet/parquet-column-readers.cc@1093
PS6, Line 1093:   // Return OK if we have already tried reading the dictionary 
page.
              :   if (col_chunk_reader_.PageHeadersRead() > 0) {
              :     return Status::OK();
              :   }
> I think this can happen when we need more than one scratch batch or row bat
I would prefer to handle this on caller side - left a comment in 
parquet-column-readers.cc about this


http://gerrit.cloudera.org:8080/#/c/23012/8/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/23012/8/be/src/exec/scratch-tuple-batch.h@139
PS8, Line 139: keep_aux_mem
if keep_aux_mem is false, then num_to_commit must be false, right? if yes, then 
there could be a DCHECK for this



--
To view, visit http://gerrit.cloudera.org:8080/23012
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4a052b028220517503e634e3f916d1fbd60eb65d
Gerrit-Change-Number: 23012
Gerrit-PatchSet: 8
Gerrit-Owner: Xuebin Su <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Xuebin Su <[email protected]>
Gerrit-Comment-Date: Thu, 07 Aug 2025 07:49:13 +0000
Gerrit-HasComments: Yes

Reply via email to