Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 11: (5 comments) http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@533 PS10, Line 533: if (fn_name == "lt" || fn_name == "le") { : // We need to get min stats. : stats_field = ColumnStatsReader::StatsField::MIN; : } else if (fn_name == "gt" || fn_name == "ge") { : // We need to get max stats. : stats_field = ColumnStatsReader::StatsField::MAX; : } else { : DCHECK(false) << "Unsupported function name for statistics evaluation: " << fn_name; : continue; : } > This could be also replaced with GetRequiredStatsField(). Done http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1186 PS10, Line 1186: I > type: It Done http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1188 PS10, Line 1188: eader->max_rep_level() == 0 > Can't this happen when 'reader' also reads a nested column, but it doesn't I don't think that's possible because once you need a repeated field "structurally" (i.e. using a collection reader at the top), you can't read another repeated field without reading its structure, i.e. it will also have a parent collection reader. http://gerrit.cloudera.org:8080/#/c/12065/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@1193 PS10, Line 1193: > See line 1188. Same answer. http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc File be/src/exec/parquet/parquet-page-index.cc: http://gerrit.cloudera.org:8080/#/c/12065/11/be/src/exec/parquet/parquet-page-index.cc@81 PS11, Line 81: page_index_buffer_.TryAllocate(size) > I thought a bit about memory management. This is the only place where we al 1. The ScanNode creates and deletes scanners for the file splits. This means a scanner is not re-used across files. It is called from ProcessFooter(), i.e. once for a file, not once for row groups. 2. Note that we read the whole page index for all the row groups here and ProcessPageIndex() is called for each row group later. Regarding to memory estimations: OK, the suggestion makes sense. I added a WARNING log when we can't allocate the buffer. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 11 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 15 Apr 2019 12:15:30 +0000 Gerrit-HasComments: Yes