Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 17: (8 comments) http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@464 PS17, Line 464: bool GetRequiredStatsField(const string& fn_name, > I think this could go into parquet-column-stats.{h,cc}. Done http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@546 PS17, Line 546: ColumnStatsReader::StatsField stats_field = ColumnStatsReader::StatsField::MIN; > I think we usually don't initialize output parameters to make it clear that Hmm, I thought clang-tidy didn't like that but it must have been something else because it doesn't complain now. http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@643 PS17, Line 643: // We don't need the raw page index buffers anymore. : page_index_.Release(); > Can this go to ProcessPageIndex? It already touches a bunch of other state. ProcessPageIndex() has some RETURN_IF_ERROR macros and I wanted to be sure about calling Release(). However, I realised that I can just use a scope exit trigger in ProcessPageIndex(). http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-column-readers.h@400 PS17, Line 400: /// True, if we are using NextLevels() to readahead the next def and rep levels. In this > I feel that this field needs more explanation. From just looking at the com Elaborated the comment. http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-column-readers.h@403 PS17, Line 403: bool levels_readahead_ = false; > Would it simplify the code to make this levels_read_ahead_offset_ (being -1 Since currently there are only two possibilities (-1 or 0), having a simple flag is more exact I think. Also, we'd still need an if stmt when we have processed all the rows, because in this case we don't need to adjust the value of 'current_row_'. http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-page-index-test.cc File be/src/exec/parquet/parquet-page-index-test.cc: http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-page-index-test.cc@58 PS17, Line 58: void ValidatePageIndexRange(const RowGroupRanges& row_group_ranges, > Add a comment for this one, too? Done http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/17/be/src/exec/parquet/parquet-page-index.h@56 PS17, Line 56: if there's at least parts of the page index are present > nit: grammar Rephrased the whole comment. http://gerrit.cloudera.org:8080/#/c/12065/17/tests/query_test/test_parquet_stats.py File tests/query_test/test_parquet_stats.py: http://gerrit.cloudera.org:8080/#/c/12065/17/tests/query_test/test_parquet_stats.py@87 PS17, Line 87: for batch_size in [0, 1]: > Should we use a proper test dimension for the batch size, e.g. like in test Yeah, I looked at that earlier, but I didn't want to run the other tests with those batch sizes, neither wanted to add if statements for each. And I use different batch sizes here and at L97 which would be a bit more complicated with the other approach. Also, it has the advantage to only load the data once. On the other hand, I agree that this is not the cleanest solution, so I can change it if you feel strong about it. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 17 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 06 May 2019 14:05:35 +0000 Gerrit-HasComments: Yes