[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Csaba Ringhofer (Code Review) Tue, 16 Apr 2019 07:01:45 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )


Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 12: Code-Review+1

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/hdfs-parquet-scanner.cc@618
PS11, Line 618:       // If the row group overlaps with the
optional: This loop grew really large, > 100 lines, doesn't even fit to my huge 
monitor. :)

At this point it was decided that the row group belongs to the scanner. The 
rest could be extracted to functions like

Status ApplyStatFilters(const parquet::RowGroup& row_group, bool* 
row_group_skipped)

and

Status ScanDictioneries(const parquet::RowGroup& row_group, bool* 
row_group_skipped)


http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h
File be/src/exec/parquet/parquet-page-index.h:

http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@39
PS11, Line 39:   /// It reads the raw bytes of the whole page index and stores 
it in an
             :   /// internal buffer.
             :   /// It doesn't expect that the Page index in a particular 
layout, it only
             :   /// expects that the whole page index layed out continuously 
in the file.
             :   /// It needs to be called before the serialization methods.
             :   Status ReadAll();
Can you mention that this reads the page index for all row groups? For me this 
was not intuitive. The function's name could also reflect this, as "ReadAll" 
could also mean reading it for all column chunks in the row group.


http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@61
PS11, Line 61:   /// Common helper for deserialization
nit: other member comments have . at the end


http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.h@65
PS11, Line 65:   /// The scanner that created this object
nit: other member comments have . at the end


http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc
File be/src/exec/parquet/parquet-page-index.cc:

http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc@81
PS11, Line 81:
> 1. The ScanNode creates and deletes scanners for the file splits. This mean
Sorry for the wrong assumptions, I got lost somewhere in 
HdfsParquetScanner::NextRowGroup().

I still think though that we hold onto memory that we do not want to use later. 
Also, theoretically there can be huge Parquet files with lot of row groups, so 
the complete page index can be arbitrarily large.

Would it be problematic to read the page index for every row group separately 
during ProcessPageIndex()? This would mean that if the row group is skipped 
because of row group level stats or because of it is out of the split (the 
latter is the normal case for multi row group files I guess), then we would 
avoid reading it's page index.

Regardless, for the single row group case, releasing the buffer if we are 
reading the last row group would be enough. This would avoid the memory wasting 
for Parquet files written by Impala.

I am ok with dealing with this in another patch.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 12
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Apr 2019 14:01:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Reply via email to