Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files

Patch Set 10:


Overall this is looking good. I had some specific concerns about some of the 
nitty-gritty details.

File be/src/exec/hdfs-parquet-table-writer.cc:

PS10, Line 301:   std::vector<std::string> min_values_;
I'm still concerned about the amount of untracked memory from min_values_ and 
max_values_, even if we truncate the string values to 1KB or similar - it seems 
like could end up with multiple MB of untracked memory. We could probably live 
with it since it's smaller than the actual data, but it's a step in the wrong 

Maybe we could store min_values_ and max_values_ as StringValues backed by 
memory per_file_mem_pool_ and then only convert to strings when writing out 
each column to the page index?

PS10, Line 735:     min_values_.push_back(std::string(""));
I don't know if we need the call to std::string() here, I think it should work 
if we just emplace_back() to instantiate an empty string.

PS10, Line 1227:   for (auto& column : columns_) {
nit: can fit loop on one line.

File be/src/exec/parquet-column-stats.h:

PS10, Line 159:   // If true, min/max values are ascending.
Maybe briefly mention why they both start off true? And both can be true at the 
same time? It's slightly subtle.

File tests/query_test/test_parquet_page_index.py:

PS10, Line 37: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite):
We've got a lot of good coverage in this test.

I'm wondering if we're missing some basic tests that confirm that the values in 
the page match the min/max values in the page index. It seems like these 
validations might not catch some kinds of bugs. E.g. min/max values in the 
index are somehow out-of-sync with the pages. Most bugs that I can imagine 
would get caught by one validation or another but it would be nice to have a 
sanity test where we confirm that the values in each page match the values in 
the page index.

PS10, Line 177: previouse_value
typo in variable name

PS10, Line 205: falied
nit: failed

PS10, Line 205: column_info_schema
this variable isn't defined - did you mean column_info?

PS10, Line 244: chars_formats
chars_formats is weird in that it's created by a different test - 
TestCharsFormats. I.e. it's not present unless that test ran before this one. 
Maybe we should change it so that the table is loaded during normal data 

To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Apr 2018 00:23:08 +0000
Gerrit-HasComments: Yes

Reply via email to