Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 2: (12 comments) Thanks for the comments! I fixed what I could, but also have some further questions. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector<bool> null_pages_; > The spec for page indexes actually allows to store min/max values that do n Tim: StringMinMaxFilter::MaterializeValues(): thanks for the tip, I'll use something similar. Lars/Tim: So, the cap should be 4kb? About the case when the string starts with 4kb of 0xFF: It seems to me that the behavior in this case should be decided by the Parquet community. Anyway, what if we don't set the max_value in this case? Looking at HdfsParquetScanner::EvaluateStatsConjuncts() that should do the job. However, now the ordering of pages gets complicated. We need "no max_value and not null pages" at the end for ascending order. Null pages also don't have max_value, but they can be anywhere in the list of pages. However, based on the comment in https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L797 , we cannot do that with the current specification. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739 PS2, Line 739: null_pages_ > See my comment above, this should only be true for null pages. I'm not sure if I've found the relevant comment. There is a comment that says the spec allows to set min and max values that are not present in the column. But it is only for compaction, right? If there are no values in the column, does it make sense to set min and max values? Yeah, the spec seems to allow that, but if Impala's implementation doesn't populate min and max values only for null pages, maybe we can use this information, can't we? Or, I can of course smarten up the parquet ColumnStats class. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1199 PS2, Line 1199: uint8_t* buffer; > declare in the most nested scope possible. Done http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1207 PS2, Line 1207: column_index.__set_null_pages(column.null_pages_); > Can we move() these to avoid a copy? Yes, done. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: < > Ideally the comparators would be implemented only once and we would call th I just realized that the min/max values are only strings at this point, at it is kind of cumbersome to retrieve the type information here. At L746 we call Merge() on row_group_stats_base_ to merge the page statistics into the column chunk statistics. Maybe Merge() could keep track of the ordering of the merged pages and then we could just call "GetBoundaryOrder()" on row_group_stats_base_. What do you think? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1254 PS2, Line 1254: > nit: extra blank line Done http://gerrit.cloudera.org:8080/#/c/9693/2/common/thrift/parquet.thrift File common/thrift/parquet.thrift: PS2: > Is this file now identical with the upstream parquet.thrift? No, I only added the changes of PARQUET-922. diff says there's quite a lot differences between the two files. http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@683 PS2, Line 683: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite): > I think this should live in its own file. It also needs a class comment. Done http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@734 PS2, Line 734: tmp_dir = make_tmp_dir() > This should use the tmpdir fixture from pytest (see other tests in this fil Done http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@746 PS2, Line 746: def _validate_ordering(ordering, schema, null_pages, values): > This should have a function comment. Added a comment. http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@785 PS2, Line 785: # Not accurate. Does not make sure null_count is equal to number of rows. > I think this should eventually be accurate (but I could be missing somethin Now I'm using num_values from PageHeader. http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@835 PS2, Line 835: def test_write_index_alltypes(self, vector, unique_database): > Let's make sure that we cover all data types (the name of alltypes is unfor Added tests for tables decimal_tbl and chars_formats. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 20 Mar 2018 19:34:03 +0000 Gerrit-HasComments: Yes
