Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 2: (5 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector<bool> null_pages_; I'm concerned that this may introduce a non-trivial amount of untracked memory, particularly for non-clustered dynamically-partitioned inserts where we have multiple table writers going at the same time. I don't think we cap the size of min/max values, so this is an additional concern. StringMinMaxFilter::MaterializeValues() has some relevant logic to cap the size of strings to 1KB, which we should probably use. I did a rough calculation for a 512MB parquet file. That could have 8192 64KB pages. Even with BIGINT values, that would be 128kb of stats, which is an amount we should be tracking. With 1kb strings that would be 8MB of untracked min/max values. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1207 PS2, Line 1207: column_index.__set_null_pages(column.null_pages_); Can we move() these to avoid a copy? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: < I think we should avoid using builtin operators to compare values since the Parquet spec's order won't necessarily always match the C++ types order. Having a helper function would make it clear that the code means the Parquet spec's order. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1254 PS2, Line 1254: nit: extra blank line http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@835 PS2, Line 835: def test_write_index_alltypes(self, vector, unique_database): Let's make sure that we cover all data types (the name of alltypes is unfortunate, since it's missing DECIMAL, CHAR and VARCHAR). -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Fri, 16 Mar 2018 22:03:02 +0000 Gerrit-HasComments: Yes
