[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

Lars Volker (Code Review) Tue, 20 Mar 2018 15:46:39 -0700

Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )


Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 3:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector<bool> null_pages_;
> Tim:
I think that null_pages should also have covered non-null pages with invalid 
stats, but the current spec doesn't seem to allow that. In light of that I see 
two options:

1) Store more than 4KB of the pathological value until a non-0xFF character. I 
think we don't allow unlimited Parquet footer sizes, but I cannot find the code 
that enforces this right now.

2) Not store a ColumnIndex for that column at all.

I am in favor of 2). This case seems highly hypothetical to me and we really 
just need to deal with it in a standard-conforming way. I don't think we need 
to care about the actual performance in this particular scenario.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739
PS2, Line 739: null_pages_
> I'm not sure if I've found the relevant comment.
Sry for the confusion. The spec says that this must be set if and only if all 
values are null. Can you add a DCHECK to make sure that the null_count is equal 
to the number of values in the page?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: n
> I just realized that the min/max values are only strings at this point, at
That sounds good to me.


http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc@1211
PS3, Line 1211:     for (int j = 0; j < column.min_values_.size() - 1; ++j) {
After moving column.min_values_ above it is invalid here. It also leaves the 
whole BaseColumnWriter in an inconsistent state and we risk to introduce errors 
in the future.

To improve this we could wrap the moves in a sub-scope that starts with 
dereferencing columns_[i] and ends with the move. We should then reset the 
column in columns_ and leave the array with NULL elements. That will require to 
make it very obvious that this happens through the name and comment of this 
function.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@55
PS3, Line 55:     # We only support flat schemas, the additional element is the 
root element.
We should assert that the schema is flat, e.g. by comparing lengths of 
row_group.columns and schemas.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@63
PS3, Line 63:  is not None
You should not need "is not None" here and elsewhere. The parentheses also 
shouldn't be necessary.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@84
PS3, Line 84:       row_group_index.append((schema, stats, offset_index, 
column_index, page_headers))
This looks complicated enough to warrant a namedtuple, maybe even a class.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@87
PS3, Line 87: get_row_group
...row_groups... (plural)


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@90
PS3, Line 90: containing row groups and filename
this comment seems wrong


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@103
PS3, Line 103:
nit: All comments should follow PEP8 class comment format.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@118
PS3, Line 118:   def _validate_parquet_index(self, hdfs_path, sorting_column, 
tmpdir, skip_col_idxs = None):
long line


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@122
PS3, Line 122:     skip_col_idxs = skip_col_idxs or []
It seems that this method is only called once, and without skip_col_idxs. 
Remove the parameter?


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@147
PS3, Line 147:             num_values = 
page_header.data_page_header_v2.num_values
Impala currently does not write these, add an assertion


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@164
PS3, Line 164:         if max_value is not None:
What happens if it is None?


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@165
PS3, Line 165:           max_value = decode_stats_value(schema, max_value)
Don't overwrite the variable


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@205
PS3, Line 205: alltypes
copy-paste error?


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@230
PS3, Line 230:
We also should have a test with a file that has 4K of 0xFF in it, just to make 
sure we handle the corner case correctly. Additionally, please test with very 
wide rows and with many columns. The functional schema has tables for this 
already:

  widerow
  widetable_1000_cols
  widetable_250_cols
  widetable_500_cols


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py
File tests/util/get_parquet_index.py:

PS3:
Maybe we should merge this file with get_parquet_metadata.py (and clean it up 
while we're here).


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py@29
PS3, Line 29: thirft
typo



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 20 Mar 2018 22:46:06 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5842: Write page index in Parquet files

Reply via email to