[Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer

Tim Armstrong (Code Review) Mon, 23 Jan 2017 17:02:12 -0800

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................



Patch Set 6:

(13 comments)

Thanks , I think this will be easier to move forward.

I think to make progress on the other types we'll have to get buy-in in the 
upstream parquet project. I wonder if that may be tough for some data types - 
e.g. we'd ideally want to sort decimal by the logical value, but the min/max 
stats design doesn't really seem to be designed with logical types in mind.

http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS6, Line 136: <
<=?


PS6, Line 178: ProcessValue
Maybe AppendValue() or something like that? Process is a little vague.


Line 389:     bool ret = bool_values_->PutValue(v, 1);
Consider something like:

if (!bool_values_->PutValue(v, 1)) return false;


PS6, Line 644: <
<=?


PS6, Line 648: column
row group


http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 105:   static const int MAX_COLUMN_STATS_SIZE = 4 * 1024;
Mention that this value came from parquet-mr?


http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

PS6, Line 125: min
Others might have different opinions but I think I'd prefer making this even 
more explicit and avoid relying on Impala's built-in operator overloads, just 
because the semantics are subtle and it's easy for someone reading this code to 
gloss over the fact that we can't always use the "obvious" ordering. The 
comments do explain it, but people don't always read the comments.

Maybe we could have a set of functions like: ParquetStatsLessThan(int32_t, 
int32_t), etc. Might be worth getting someone else's opinion before making 
changes.


Line 139:     out->__set_min(min_str);
It probably doesn't really matter, but I think using move() here would a copy 
of min_str and max_str


PS6, Line 150: ((uint8_t*)
reinterpret_cast


Line 183: /// parquet-mr and subsequently Hive currently do not handle the 
following types
We should file an Impala JIRA to keep track of this too


http://gerrit.cloudera.org:8080/#/c/5611/6/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 208: class TestHdfsParquetTableStatsWriter(ImpalaTestSuite):
It would be good to have some kind of test for multiple row groups so that code 
is exercised.

Otherwise I can imagine it'd be easiest for a bug around resetting state 
between row groups to creep in.

Could we do some kind of validation like writing out the data in sorted order, 
then validating that the min-max ranges don't overlap?


Line 309:       (0, 7299),
It would be nice to test a table with a negative number to make sure that those 
are handled correctly.


http://gerrit.cloudera.org:8080/#/c/5611/6/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

Line 111:   return None
This return isn't needed


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3909: Populate min/max statistics in Parquet writer

Reply via email to