Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector<bool> null_pages_;
I'm concerned that this may introduce a non-trivial amount of untracked memory, 
particularly for non-clustered dynamically-partitioned inserts where we have 
multiple table writers going at the same time. I don't think we cap the size of 
min/max values, so this is an additional concern.

StringMinMaxFilter::MaterializeValues() has some relevant logic to cap the size 
of strings to 1KB, which we should probably use.

I did a rough calculation for a 512MB parquet file. That could have 8192 64KB 
pages. Even with BIGINT values, that would be 128kb of stats, which is an 
amount we should be tracking. With 1kb strings that would be 8MB of untracked 
min/max values.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1207
PS2, Line 1207:     column_index.__set_null_pages(column.null_pages_);
Can we move() these to avoid a copy?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: <
I think we should avoid using builtin operators to compare values since the 
Parquet spec's order won't necessarily always match the C++ types order.

Having a helper function would make it clear that the code means the Parquet 
spec's order.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1254
PS2, Line 1254:
nit: extra blank line


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@835
PS2, Line 835:   def test_write_index_alltypes(self, vector, unique_database):
Let's make sure that we cover all data types (the name of alltypes is 
unfortunate, since it's missing DECIMAL, CHAR and VARCHAR).



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Fri, 16 Mar 2018 22:03:02 +0000
Gerrit-HasComments: Yes

Reply via email to