Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered 
out rows in Parquet table.
......................................................................


Patch Set 5:

(7 comments)

Looks good!

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330
PS3, Line 2330:
> >> By reading the code, my guess is that each batch covers a >> number of,
Okay. Thanks for the clarification on skip length.

My guess is that converting to batches may not be beneficial, especially when 
the T and F are interleaved tightly (the common case). In this case, you may 
need to recheck the selected rows in the batch.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2381
PS3, Line 2381: if (micro_batches[0].start > 0) {
              :           if 
(UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) {
              :             return Status(Substitute("Couldn't skip rows in 
file $0.", filename()));
              :           }
              :         }
> It is possible to have micro_batches[0].start==0, but in that case we don't
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2417
PS3, Line 2417: return Status::OK();
> This behaviour is retained from earlier code. You will find the code in Ass
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@553
PS3, Line 553:
> sure. Not sure if this happened due to clang-format.
Normally, I just highlight a section of code modified (in vi: shift-v and 
scroll down with j) and clang format (in vi: control-k) :-).


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1151
PS3, Line 1151: Status::OK();
> Not really. It depends on if abort_on_error is set as Query option. LogCorr
Good to know! Done.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1484
PS3, Line 1484: num_buffered_values_ == 0
> This signifies end of RowGroup. Same logic is currently being used (check N
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1538
PS3, Line 1538: }
> We don't need to check, we can just return it and client will check it. But
Hmm. It seems a false returning status means likely the data in the page is 
corrupted. Sounds like we should bail out? The return status from 
SkipTopLevelRows() in the code below is checked.

1218 bool BaseScalarColumnReader::SkipTopLevelRows(int64_t num_rows) {
1219   DCHECK_GE(num_buffered_values_, num_rows);                               
  
1220   DCHECK_GT(num_rows, 0);
.. .. ..
1272     }                                                                      
1273   }                                                  
1274   return SkipEncodedValuesInPage(num_values_to_skip);
1275 }

 385 template <typename InternalType, parquet::Type::type PARQUET_TYPE, bool 
MATERIALIZED>
 386 bool ScalarColumnReader<InternalType, PARQUET_TYPE,
 387     MATERIALIZED>::SkipEncodedValuesInPage(int64_t num_values) {
 388   if (bool_decoder_) {                                                     
       
 389     return bool_decoder_->SkipValues(num_values);           
 390   }                                   
 391   if (IsDictionaryEncoding(page_encoding_)) {                              
                
 392     return dict_decoder_.SkipValues(num_values);                   
 393   } else {                                   
 394     DCHECK_EQ(page_encoding_, Encoding::PLAIN);
 395     int64_t encoded_len = ParquetPlainEncoder::EncodedLen<PARQUET_TYPE>(  
 396         data_, data_end_, fixed_len_size_, num_values);
 397     if (encoded_len < 0) return false;
 398     data_ += encoded_len;                                                  
      
 399   }                                                        
 400   return true;                       
 401 }



--
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 5
Gerrit-Owner: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 07 Oct 2021 13:46:47 +0000
Gerrit-HasComments: Yes

Reply via email to