Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 2:

(12 comments)

Thanks for the comments! I fixed what I could, but also have some further 
questions.

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector<bool> null_pages_;
> The spec for page indexes actually allows to store min/max values that do n
Tim:

StringMinMaxFilter::MaterializeValues(): thanks for the tip, I'll use something 
similar.

Lars/Tim:
So, the cap should be 4kb?


About the case when the string starts with 4kb of 0xFF:
It seems to me that the behavior in this case should be decided by the Parquet 
community.

Anyway, what if we don't set the max_value in this case? Looking at 
HdfsParquetScanner::EvaluateStatsConjuncts() that should do the job. However, 
now the ordering of pages gets complicated. We need "no max_value and not null 
pages" at the end for ascending order. Null pages also don't have max_value, 
but they can be anywhere in the list of pages.

However, based on the comment in 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L797
 , we cannot do that with the current specification.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739
PS2, Line 739: null_pages_
> See my comment above, this should only be true for null pages.
I'm not sure if I've found the relevant comment.

There is a comment that says the spec allows to set min and max values that are 
not present in the column. But it is only for compaction, right? If there are 
no values in the column, does it make sense to set min and max values?

Yeah, the spec seems to allow that, but if Impala's implementation doesn't 
populate min and max values only for null pages, maybe we can use this 
information, can't we?

Or, I can of course smarten up the parquet ColumnStats class.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1199
PS2, Line 1199:   uint8_t* buffer;
> declare in the most nested scope possible.
Done


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1207
PS2, Line 1207:     column_index.__set_null_pages(column.null_pages_);
> Can we move() these to avoid a copy?
Yes, done.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: <
> Ideally the comparators would be implemented only once and we would call th
I just realized that the min/max values are only strings at this point, at it 
is kind of cumbersome to retrieve the type information here.

At L746 we call Merge() on row_group_stats_base_ to merge the page statistics 
into the column chunk statistics. Maybe Merge() could keep track of the 
ordering of the merged pages and then we could just call "GetBoundaryOrder()" 
on row_group_stats_base_. What do you think?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1254
PS2, Line 1254:
> nit: extra blank line
Done


http://gerrit.cloudera.org:8080/#/c/9693/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

PS2: 
> Is this file now identical with the upstream parquet.thrift?
No, I only added the changes of PARQUET-922.
diff says there's quite a lot differences between the two files.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@683
PS2, Line 683: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite):
> I think this should live in its own file. It also needs a class comment.
Done


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@734
PS2, Line 734:     tmp_dir = make_tmp_dir()
> This should use the tmpdir fixture from pytest (see other tests in this fil
Done


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@746
PS2, Line 746:   def _validate_ordering(ordering, schema, null_pages, values):
> This should have a function comment.
Added a comment.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@785
PS2, Line 785:           # Not accurate. Does not make sure null_count is equal 
to number of rows.
> I think this should eventually be accurate (but I could be missing somethin
Now I'm using num_values from PageHeader.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@835
PS2, Line 835:   def test_write_index_alltypes(self, vector, unique_database):
> Let's make sure that we cover all data types (the name of alltypes is unfor
Added tests for tables decimal_tbl and chars_formats.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 20 Mar 2018 19:34:03 +0000
Gerrit-HasComments: Yes

Reply via email to