[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Csaba Ringhofer (Code Review) Fri, 12 Apr 2019 10:17:13 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )


Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 11:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@533
PS10, Line 533: if (fn_name == "lt" || fn_name == "le") {
              :       // We need to get min stats.
              :       stats_field = ColumnStatsReader::StatsField::MIN;
              :     } else if (fn_name == "gt" || fn_name == "ge") {
              :       // We need to get max stats.
              :       stats_field = ColumnStatsReader::StatsField::MAX;
              :     } else {
              :       DCHECK(false) << "Unsupported function name for 
statistics evaluation: " << fn_name;
              :       continue;
              :     }
This could be also replaced with GetRequiredStatsField().


http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1186
PS10, Line 1186: I
type: It


http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1188
PS10, Line 1188: eader->max_rep_level() == 0
Can't this happen when 'reader' also reads a nested column, but it doesn't have 
a parent collection reader, as described in the comments't last paragraph, and 
reference_reader reads a collection with deeper nesting?


http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1193
PS10, Line 1193:
See line 1188.


http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc
File be/src/exec/parquet/parquet-page-index.cc:

http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc@81
PS11, Line 81: page_index_buffer_.TryAllocate(size)
I thought a bit about memory management. This is the only place where we 
allocate, but we never call Release(), it only happens when the parent scanner 
is deleted.

I think that we should release memory earlier for 2 reasons:

1. TryAllocate() expects the buffer to be empty, which will not be true if 
ParquetPageIndex is initialized again, e.g. when reading a new row group in 
either the current or a new file. I wonder why we didn't hit this error during 
tests.

2. Keeping the page index buffer in memory during scanning is not necessary, as 
we only use it during HdfsParquetScanner::ProcessPageIndex().

Related thing to think about is memory reservation - does the memory 
requirement of Parquet scanning rise significantly when we use page indexes?
computeMinScalarColumnMemReservation() and friends in
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
try to estimate the minimum memory needed by scanning. They are supposed to 
only overestimate, not under.

A safe way would be to assume a maximum size for the page index, add it to the 
reservation, and skip the page index if it is bigger than that. Or probably it 
would be better to always try to allocate, and treat allocation failure as 
non-fatal, and simply skip the page index in that case.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 11
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Pooja Nilangekar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 12 Apr 2019 17:16:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Reply via email to