Amogh Margoor has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out 
rows in Parquet table.
......................................................................


Patch Set 9:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h@59
PS8, Line 59: rows
> nit. use of 'tuples' instead of 'row' here should be better as tuples are m
Done.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@66
PS8, Line 66:
> nit. Is it possible to DCHECK() it?
It is being checked in FilterScratchBatch. Moved comment to the same function 
too.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@110
PS8, Line 110: int 
HdfsColumnarScanner::ProcessScratchBatchCodegenOrInterpret(RowBatch* dst_batch) 
{
> nit: unintended new line?
Right. This function was changed in earlier patch with an extra argument. Hence 
formatting went for toss. Changed it now.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@562
PS8, Line 562: es
> nit. have.
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@935
PS8, Line 935:   /// Tuple memory to write to is specified by 
'scratch_batch->tuple_mem'.
> nit. Will be useful to describe non_filter_readers: All 'non_filter_readers
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117
PS8, Line 117:  complete_micro_batch_.start = 0;
             :   complete_micro_batch_.end = scratch_batch_->capacity - 1;
             :   complete_micro_batch_.length = scratch_batch_->capacity;
> This probably should be moved to the cst.
what is cst ?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@262
PS8, Line 262:   if (DoesNotSupportLateMaterialization(column_reader) || 
(slot_desc != nullptr
             :         && std::find(conjunct_slot_ids_.begin(), 
conjunct_slot_ids_.end(), slot_desc->id())
             :             != conjunct_slot_ids_.end())) {
> this loop can be written as
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@268
PS8, Line 268:
> Should we do the following prior to the for loop or DCHECK() they are empty
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2215
PS8, Line 2215: These c
> nit. Suggest to remove as it can cause confusion on the action taken in thi
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2218
PS8, Line 2218:  materializing tupl
> remove?
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2236
PS8, Line 2236: ptr);
> nit: "not needed" is better.
it is both actually. have updated comment.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253
PS8, Line 2253: // 1. Filter rows only materializing the columns in 
'filter_readers_'
              :     // 2. Transfer the surviving rows
              :     // 3. Materialize rest of the columns only for surviving 
rows.
              :
              :     RETURN_IF_ERROR(FillScratchMicroBatches(filter_readers_, 
row_batch,
              :     
> nit Suggest to use the following pattern to enhance code readability.
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2282
PS8, Line 2282: omplete_mi
> Suggest to use int* = nullptr instead of int& for the function, so that we
It cannot be nullptr and will be modified by existing APIs like ReadValueBatch. 
but it has been changed to pointer though.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2291
PS8, Line 2291: if (*skip_row_group) { return Status::OK(); }
              :     }
> nit. same comment on code readability.
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336
PS8, Line 2336: >S
> nit to
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2358
PS8, Line 2358:     bool continue_execution = false;
              :     int last = -1;
> nit. Same comment above.
We need the index here unlike the above place.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2363
PS8, Line 2363:     if (UNL
> UNLIKELY
In many cases it would be micro batch of size 1.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2364
PS8, Line 2364:      return Status(Substitute("Cou
> I wonder why we need this test.
First batch might start from 100th row, in which case we need to skip first 100 
rows.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@168
PS8, Line 168:   /// row after 'skip_row_id' and ignores 'num_rows'.
> nit. add "and ignores 'num_rows'".
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@509
PS8, Line 509: re not st
> Please add a comment about 'remaining'
Done.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1151
PS8, Line 1151: Status::OK();
> Should we return error instead OK here?
Nope. As commented earlier if error needs to be returned 
LogCorruptNumValuesInMetadataError will do it.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1156
PS8, Line 1156: RETURN_IF_ERROR(col_chunk_reader_.ReadNextDataPage(&eos, 
&data_, &data_size,
> nit. Can put the comment next to the bool argument for better readability.
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1282
PS8, Line 1282: *
> Output parameters should be pointers.
Done. changed it at other places too.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h@177
PS8, Line 177: int skip_length, ScratchMi
> nit: usually we put output parameters at the end of the param list
Done



--
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 9
Gerrit-Owner: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 21 Oct 2021 19:18:16 +0000
Gerrit-HasComments: Yes

Reply via email to