Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 17: (19 comments) http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/exec/hdfs-parquet-table-writer.cc@769 PS16, Line 769: if (UNLIKELY(!table_sink_mem_tracker_->TryConsume(new_memory_allocation))) { > Need to increment this after the TryConsume() succeeds, otherwise we'll rel Ouch, right. Done. http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/9693/15/be/src/util/string-util-test.cc@68 PS15, Line 68: EvalTruncation(normal_plus_max, "abcdeg", 10, UP); > Thanks for adding more tests. I meant the exact 3-character string consisti I had a similar test, when I used the '127' literal. I updated it to use your 3-character string. http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/9693/16/be/src/util/string-util.cc@46 PS16, Line 46: // We convert it to unsigned because signed overflow results in undefined behavior. > I think this could use a comment and/or an explicit cast to make the intent Added comment and cast. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@36 PS16, Line 36: > update Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@55 PS16, Line 55: > nit: PEP8 says the closing """ should be on a new line. You can use flake8 Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@90 PS16, Line 90: column_info = ColumnInfo(schema, stats, offset_index, column_index, page_headers) > I think you should be able to just pass all the parameters to the ctor "Col Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@96 PS16, Line 96: e_headers) > "ColumnInfo objects" or "column infos" "column infos", done. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@97 PS16, Line 97: > the first row group? Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@110 PS16, Line 110: sert previous_loc.offset < current > You can zip a list with itself to create (prev, cur) pairs: Nice! Thanks, done. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@124 PS16, Line 124: sert page > nit: page_is_null might be more clear Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@182 PS16, Line 182: else: > Would it make sense to validate the ordering instead of re-computing it? I. Yeah, I think it's better to use a different algorithm with the same semantics. Rewrote this function. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@214 PS16, Line 214: > unused? Removed it. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@216 PS16, Line 216: and that th > Stale comment? Yes, removed it. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@249 PS16, Line 249: else: > move closer to the comment explaining the intent Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@321 PS16, Line 321: tmp > nit: end Done http://gerrit.cloudera.org:8080/#/c/9693/16/tests/query_test/test_parquet_page_index.py@348 PS16, Line 348: : column = > I had missed this looking at the truncation code. Why do we keep the zeroes Yeah, I don't think we need those. Changed the logic. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@31 PS16, Line 31: "" : with open(file > That does not actually seem to be a requirement of the function. Should we Removed the last two sentences from the comment and renamed the method to 'read_range'. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@39 PS16, Line 39: contain a serialized thrift object.""" > Mention "serialized_object" in the comment, or rename the parameter to buff Renamed the parameter to 'serialized_object_buffer'. http://gerrit.cloudera.org:8080/#/c/9693/16/tests/util/get_parquet_metadata.py@171 PS16, Line 171: (serialized_thrift_obj > Can read_serialized_object also create the protocol and return it? You coul I like the idea since we won't duplicate code that much. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 17 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 14 May 2018 16:58:38 +0000 Gerrit-HasComments: Yes