Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )
Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table. ...................................................................... Patch Set 5: (7 comments) Looks good! http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330 PS3, Line 2330: > >> By reading the code, my guess is that each batch covers a >> number of, Okay. Thanks for the clarification on skip length. My guess is that converting to batches may not be beneficial, especially when the T and F are interleaved tightly (the common case). In this case, you may need to recheck the selected rows in the batch. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2381 PS3, Line 2381: if (micro_batches[0].start > 0) { : if (UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) { : return Status(Substitute("Couldn't skip rows in file $0.", filename())); : } : } > It is possible to have micro_batches[0].start==0, but in that case we don't Done http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2417 PS3, Line 2417: return Status::OK(); > This behaviour is retained from earlier code. You will find the code in Ass Done http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@553 PS3, Line 553: > sure. Not sure if this happened due to clang-format. Normally, I just highlight a section of code modified (in vi: shift-v and scroll down with j) and clang format (in vi: control-k) :-). http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1151 PS3, Line 1151: Status::OK(); > Not really. It depends on if abort_on_error is set as Query option. LogCorr Good to know! Done. http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1484 PS3, Line 1484: num_buffered_values_ == 0 > This signifies end of RowGroup. Same logic is currently being used (check N Done http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1538 PS3, Line 1538: } > We don't need to check, we can just return it and client will check it. But Hmm. It seems a false returning status means likely the data in the page is corrupted. Sounds like we should bail out? The return status from SkipTopLevelRows() in the code below is checked. 1218 bool BaseScalarColumnReader::SkipTopLevelRows(int64_t num_rows) { 1219 DCHECK_GE(num_buffered_values_, num_rows); 1220 DCHECK_GT(num_rows, 0); .. .. .. 1272 } 1273 } 1274 return SkipEncodedValuesInPage(num_values_to_skip); 1275 } 385 template <typename InternalType, parquet::Type::type PARQUET_TYPE, bool MATERIALIZED> 386 bool ScalarColumnReader<InternalType, PARQUET_TYPE, 387 MATERIALIZED>::SkipEncodedValuesInPage(int64_t num_values) { 388 if (bool_decoder_) { 389 return bool_decoder_->SkipValues(num_values); 390 } 391 if (IsDictionaryEncoding(page_encoding_)) { 392 return dict_decoder_.SkipValues(num_values); 393 } else { 394 DCHECK_EQ(page_encoding_, Encoding::PLAIN); 395 int64_t encoded_len = ParquetPlainEncoder::EncodedLen<PARQUET_TYPE>( 396 data_, data_end_, fixed_len_size_, num_values); 397 if (encoded_len < 0) return false; 398 data_ += encoded_len; 399 } 400 return true; 401 } -- To view, visit http://gerrit.cloudera.org:8080/17860 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60 Gerrit-Change-Number: 17860 Gerrit-PatchSet: 5 Gerrit-Owner: Amogh Margoor <[email protected]> Gerrit-Reviewer: Amogh Margoor <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Kurt Deschler <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 07 Oct 2021 13:46:47 +0000 Gerrit-HasComments: Yes
