Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out 
rows in Parquet table.
......................................................................


Patch Set 10:

(10 comments)

Looks great!

On testing, I wonder if we can add a counter on # of rows (or amount of data) 
not surviving the materialization. This will be useful to safe guard the 
feature and demonstrate its usefulness.

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch-test.cc@67
PS10, Line 67: // set every 16th row as selected.
I wonder if we can add two more tests for the following situations.

1. Clustered: over 1024 values, 200 consecutive are true and the rest is false;
2. interleaved: over 1024 values, randomly set 10%, 30%, and 70% to true.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@29
PS10, Line 29: ScratchMicroBatch
May need a cstr to properly init these fields.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@171
PS10, Line 171:   /// Consecutive bits set are used to create ranges. Ranges 
that differ by less than
nit (or micro batches).


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@172
PS10, Line 172:  E.g., for ranges 1-8, 11-20, 35-100 derived
              :   /// from 'selected_rows' and 'skip_length' as 10, first two 
ranges would be merged
              :   /// into 1-20 as they differ by 3 (11 - 8) which is less than 
10 ('skip_length').
This is wonderful.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@176
PS10, Line 176: atleast
nit.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@178
PS10, Line 178: range
nit. batch_idx may be a better name in this method.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@203
PS10, Line 203: DCHECK(start != -1) << "Atleast one of the 
'scratch_batch_->selected_rows'"
              :                         << "should be true";
nit. An alternative is the following, which is more robust.

if (start == -1) {
 return 0;
}


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/service/query-options.h@50
PS10, Line 50: PARQUET_MATERIALIZATION_THRESHOLD
nit: PARQUET_LATE_MATERIALIZATION_THRESHOLD?


http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/ImpalaService.thrift@701
PS10, Line 701:   // of columns in parquet
nit. -1 to turn off the feature.


http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/Query.thrift@554
PS10, Line 554:   // of columns in parquet
nit. -1 to turn off the feature.



--
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 10
Gerrit-Owner: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 22 Oct 2021 14:43:30 +0000
Gerrit-HasComments: Yes

Reply via email to