Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered 
out rows in Parquet table.
......................................................................


Patch Set 3:

(9 comments)

Was able to complete the review. Have some comments.

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h@60
PS3, Line 60: bool* selected_rows
Need to document the use of this new argument in the comment above.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@249
PS3, Line 249: vector<ParquetColumnReader*>&
nit. const vector<ParquetColumnReader*>&.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@250
PS3, Line 250: vector<ParquetColumnReader*>& filter_readers,
             :     vector<ParquetColumnReader*>& non_filter_readers)
In Impala BE, prefer to use vector<ParquetColumnReader*>* when the argument is 
to be modified.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS3, Line 253:   conjuncts.insert(std::end(conjuncts), 
std::begin(scan_node_->conjuncts()),
             :       std::end(scan_node_->conjuncts()));
             :   conjuncts.insert(std::end(conjuncts), 
std::begin(scan_node_->filter_exprs()),
Can you please check if the code is sufficient to deal with min/max filters 
from HJs?

For min/max filter F from HJ, we allocate two slots in minMaxTuple_ (see 
HdfsScanNode.java  line 843 to 846, and save the start slot index, together 
with the filter Id, in overlapPredicateDescs_.

Thus, we may need to iterate over BE version of  overlapPredicateDescs_ (see 
GetOverlapPredicateDescs() in parquet/hdfs-parquet-scanner.h) to grab the pair 
of the slots from which the column path can be found.

Is it possible to later materialize rows surviving a min/max filter?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@270
PS3, Line 270: vector<ScalarExpr*> conjuncts
should use const T&.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330
PS3, Line 2330: int skip_length)
By reading the code, my guess is that each batch covers a number of, up to skip 
length, selected rows.

When the selected rows interleave with non-selected rows (for example, T, T, F, 
T, F, F, T), then use of selected_rows as the representation should be better, 
since here in the code, the batch covers both rows T and F.

When the selected rows form a cluster (such as on a sorted column), then the 
current implementation selects on span of rows which is very nice.

Is it possible to work with selected_rows directly?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2377
PS3, Line 2377: continue_execution
should init this boolean, as the code below conditionally set it.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2381
PS3, Line 2381: if (micro_batches[0].start > 0) {
              :           if 
(UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) {
              :             return Status(Substitute("Couldn't skip rows in 
file $0.", filename()));
              :           }
              :         }
Do we need an else branch to deal with micro_batches[0].start==0? If it is not 
possible, a DCHECK() on it probably should be added.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2417
PS3, Line 2417: return Status::OK();
I wonder if a non Status::OK() object should return here.



--
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 04 Oct 2021 19:38:15 +0000
Gerrit-HasComments: Yes

Reply via email to