Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )
Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... Patch Set 10: (10 comments) Looks great! On testing, I wonder if we can add a counter on # of rows (or amount of data) not surviving the materialization. This will be useful to safe guard the feature and demonstrate its usefulness. http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch-test.cc File be/src/exec/scratch-tuple-batch-test.cc: http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch-test.cc@67 PS10, Line 67: // set every 16th row as selected. I wonder if we can add two more tests for the following situations. 1. Clustered: over 1024 values, 200 consecutive are true and the rest is false; 2. interleaved: over 1024 values, randomly set 10%, 30%, and 70% to true. http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h File be/src/exec/scratch-tuple-batch.h: http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@29 PS10, Line 29: ScratchMicroBatch May need a cstr to properly init these fields. http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@171 PS10, Line 171: /// Consecutive bits set are used to create ranges. Ranges that differ by less than nit (or micro batches). http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@172 PS10, Line 172: E.g., for ranges 1-8, 11-20, 35-100 derived : /// from 'selected_rows' and 'skip_length' as 10, first two ranges would be merged : /// into 1-20 as they differ by 3 (11 - 8) which is less than 10 ('skip_length'). This is wonderful. http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@176 PS10, Line 176: atleast nit. http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@178 PS10, Line 178: range nit. batch_idx may be a better name in this method. http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@203 PS10, Line 203: DCHECK(start != -1) << "Atleast one of the 'scratch_batch_->selected_rows'" : << "should be true"; nit. An alternative is the following, which is more robust. if (start == -1) { return 0; } http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/service/query-options.h File be/src/service/query-options.h: http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/service/query-options.h@50 PS10, Line 50: PARQUET_MATERIALIZATION_THRESHOLD nit: PARQUET_LATE_MATERIALIZATION_THRESHOLD? http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/ImpalaService.thrift@701 PS10, Line 701: // of columns in parquet nit. -1 to turn off the feature. http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/Query.thrift File common/thrift/Query.thrift: http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/Query.thrift@554 PS10, Line 554: // of columns in parquet nit. -1 to turn off the feature. -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 10 Gerrit-Owner: Amogh Margoor <[email protected]> Gerrit-Reviewer: Amogh Margoor <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 22 Oct 2021 14:43:30 +0000 Gerrit-HasComments: Yes
