Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 3: (19 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector<bool> null_pages_; > Tim: I think that null_pages should also have covered non-null pages with invalid stats, but the current spec doesn't seem to allow that. In light of that I see two options: 1) Store more than 4KB of the pathological value until a non-0xFF character. I think we don't allow unlimited Parquet footer sizes, but I cannot find the code that enforces this right now. 2) Not store a ColumnIndex for that column at all. I am in favor of 2). This case seems highly hypothetical to me and we really just need to deal with it in a standard-conforming way. I don't think we need to care about the actual performance in this particular scenario. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739 PS2, Line 739: null_pages_ > I'm not sure if I've found the relevant comment. Sry for the confusion. The spec says that this must be set if and only if all values are null. Can you add a DCHECK to make sure that the null_count is equal to the number of values in the page? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: n > I just realized that the min/max values are only strings at this point, at That sounds good to me. http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc@1211 PS3, Line 1211: for (int j = 0; j < column.min_values_.size() - 1; ++j) { After moving column.min_values_ above it is invalid here. It also leaves the whole BaseColumnWriter in an inconsistent state and we risk to introduce errors in the future. To improve this we could wrap the moves in a sub-scope that starts with dereferencing columns_[i] and ends with the move. We should then reset the column in columns_ and leave the array with NULL elements. That will require to make it very obvious that this happens through the name and comment of this function. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@55 PS3, Line 55: # We only support flat schemas, the additional element is the root element. We should assert that the schema is flat, e.g. by comparing lengths of row_group.columns and schemas. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@63 PS3, Line 63: is not None You should not need "is not None" here and elsewhere. The parentheses also shouldn't be necessary. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@84 PS3, Line 84: row_group_index.append((schema, stats, offset_index, column_index, page_headers)) This looks complicated enough to warrant a namedtuple, maybe even a class. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@87 PS3, Line 87: get_row_group ...row_groups... (plural) http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@90 PS3, Line 90: containing row groups and filename this comment seems wrong http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@103 PS3, Line 103: nit: All comments should follow PEP8 class comment format. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@118 PS3, Line 118: def _validate_parquet_index(self, hdfs_path, sorting_column, tmpdir, skip_col_idxs = None): long line http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@122 PS3, Line 122: skip_col_idxs = skip_col_idxs or [] It seems that this method is only called once, and without skip_col_idxs. Remove the parameter? http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@147 PS3, Line 147: num_values = page_header.data_page_header_v2.num_values Impala currently does not write these, add an assertion http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@164 PS3, Line 164: if max_value is not None: What happens if it is None? http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@165 PS3, Line 165: max_value = decode_stats_value(schema, max_value) Don't overwrite the variable http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@205 PS3, Line 205: alltypes copy-paste error? http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@230 PS3, Line 230: We also should have a test with a file that has 4K of 0xFF in it, just to make sure we handle the corner case correctly. Additionally, please test with very wide rows and with many columns. The functional schema has tables for this already: widerow widetable_1000_cols widetable_250_cols widetable_500_cols http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py File tests/util/get_parquet_index.py: PS3: Maybe we should merge this file with get_parquet_metadata.py (and clean it up while we're here). http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py@29 PS3, Line 29: thirft typo -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 3 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 20 Mar 2018 22:46:06 +0000 Gerrit-HasComments: Yes
