Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 2:

(12 comments)

I, too, left some high level comments.

http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@126
PS1, Line 126:   /// Write the column index and offset index of each page in 
the file.
Do we otherwise support writing more than one rowgroup?


http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@127
PS1, Line 127:
... would only be written... ?


http://gerrit.cloudera.org:8080/#/c/9693/1/be/src/exec/hdfs-parquet-table-writer.h@127
PS1, Line 127:
typo


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector<bool> null_pages_;
> I'm concerned that this may introduce a non-trivial amount of untracked mem
The spec for page indexes actually allows to store min/max values that do not 
occur in the page: 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L802

We should compute shorter prefixes when possible and cut them off at 4kb.

We also need to decide what to do if a string starts with 4kb of 0xFF.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739
PS2, Line 739: null_pages_
See my comment above, this should only be true for null pages.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1199
PS2, Line 1199:   uint8_t* buffer;
declare in the most nested scope possible.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: <
> I think we should avoid using builtin operators to compare values since the
Ideally the comparators would be implemented only once and we would call the 
same methods here and in the parquet stats.


http://gerrit.cloudera.org:8080/#/c/9693/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

PS2:
Is this file now identical with the upstream parquet.thrift?


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@683
PS2, Line 683: class TestHdfsParquetTableIndexWriter(ImpalaTestSuite):
I think this should live in its own file. It also needs a class comment.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@734
PS2, Line 734:     tmp_dir = make_tmp_dir()
This should use the tmpdir fixture from pytest (see other tests in this file)


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@746
PS2, Line 746:   def _validate_ordering(ordering, schema, null_pages, values):
This should have a function comment.


http://gerrit.cloudera.org:8080/#/c/9693/2/tests/query_test/test_insert_parquet.py@785
PS2, Line 785:           # Not accurate. Does not make sure null_count is equal 
to number of rows.
I think this should eventually be accurate (but I could be missing something).



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Fri, 16 Mar 2018 22:31:21 +0000
Gerrit-HasComments: Yes

Reply via email to