Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 2: (20 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector<bool> null_pages_; > Just saw Tim's comment. I think we discussed this a while ago in the Parque OK, I'll implement this part after that. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739 PS2, Line 739: null_pages_ > Sry for the confusion. The spec says that this must be set if and only if a Oh, sure. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195 PS2, Line 1195: // If the file contains more than one row_group, don't write the index. > I missed this earlier - this seems weird. Do we ever write more than one ro AFAIK we always write only one row group. I just copy-pasted this part from Pooja's work. Should I substitute it to a DCHECK maybe? http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224 PS2, Line 1224: < > That sounds good to me. Great, I implemented it that way. http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc@1211 PS3, Line 1211: // Let's see if there's an ordering on min/max values. > After moving column.min_values_ above it is invalid here. It also leaves th We are using columns_ in the next section to write the offset index, so can't reset them here, but at the end of the function. I only call BaseColumnWriter::Reset() on them, because unique_ptr::reset generates segfaults later. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@55 PS3, Line 55: > We should assert that the schema is flat, e.g. by comparing lengths of row_ Added an assert. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@63 PS3, Line 63: > You should not need "is not None" here and elsewhere. The parentheses also I removed it from couple of places. Kept them in asserts, because I think it is more readable that way. Also kept for "if max_value is not None" and such, because max_value can be set to 0. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@84 PS3, Line 84: > This looks complicated enough to warrant a namedtuple, maybe even a class. I chose namedtuple http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@87 PS3, Line 87: > ...row_groups... (plural) Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@90 PS3, Line 90: > this comment seems wrong Removed the last sentence. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@103 PS3, Line 103: > nit: All comments should follow PEP8 class comment format. Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@118 PS3, Line 118: > long line Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@122 PS3, Line 122: > It seems that this method is only called once, and without skip_col_idxs. R Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@147 PS3, Line 147: > Impala currently does not write these, add an assertion Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@164 PS3, Line 164: > What happens if it is None? Added some extra asserts about it. If either stats.min_value or stats.max_value is None, the other must be None as well. And all pages need to be null. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@165 PS3, Line 165: > Don't overwrite the variable Done http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@205 PS3, Line 205: > copy-paste error? yup, done. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@230 PS3, Line 230: > We also should have a test with a file that has 4K of 0xFF in it, just to m Do you know a quick way of generate a parquet file like that? I added the other tests. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py File tests/util/get_parquet_index.py: PS3: > Maybe we should merge this file with get_parquet_metadata.py (and clean it Moved the contents to get_parquet_metadata. http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py@29 PS3, Line 29: > typo Done -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 23 Mar 2018 22:27:10 +0000 Gerrit-HasComments: Yes
