[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Zoltan Borok-Nagy (Code Review) Mon, 15 Apr 2019 05:15:59 -0700

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )


Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 11:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@533
PS10, Line 533: if (fn_name == "lt" || fn_name == "le") {
              :       // We need to get min stats.
              :       stats_field = ColumnStatsReader::StatsField::MIN;
              :     } else if (fn_name == "gt" || fn_name == "ge") {
              :       // We need to get max stats.
              :       stats_field = ColumnStatsReader::StatsField::MAX;
              :     } else {
              :       DCHECK(false) << "Unsupported function name for 
statistics evaluation: " << fn_name;
              :       continue;
              :     }
> This could be also replaced with GetRequiredStatsField().
Done


http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1186
PS10, Line 1186: I
> type: It
Done


http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1188
PS10, Line 1188: eader->max_rep_level() == 0
> Can't this happen when 'reader' also reads a nested column, but it doesn't
I don't think that's possible because once you need a repeated field 
"structurally" (i.e. using a collection reader at the top), you can't read 
another repeated field without reading its structure, i.e. it will also have a 
parent collection reader.


http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1193
PS10, Line 1193:
> See line 1188.
Same answer.


http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc
File be/src/exec/parquet/parquet-page-index.cc:

http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc@81
PS11, Line 81: page_index_buffer_.TryAllocate(size)
> I thought a bit about memory management. This is the only place where we al
1. The ScanNode creates and deletes scanners for the file splits. This means a 
scanner is not re-used across files.
It is called from ProcessFooter(), i.e. once for a file, not once for row 
groups.

2. Note that we read the whole page index for all the row groups here and 
ProcessPageIndex() is called for each row group later.

Regarding to memory estimations:
OK, the suggestion makes sense. I added a WARNING log when we can't allocate 
the buffer.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Apr 2019 12:15:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Reply via email to