Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 8: (2 comments) http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector<bool> null_pages_; > About leaving max_value unset: we cannot really do that, because 'max_value About the formula for prefix size: what would happen, if there is no common prefix? If num_values is small (like 64 in the plain encoded 1k length case), the result would be 0. About the untracked memory usage: the the min/max values could be collected by page_stats_base_, which knows the type of the column, so it wouldn't need to store fixed length data as string. When a data page is over, it could write its min/max value to a buffer. The min/max values would still had to be converted to string for thrift serialization, but columns could do this one-by-one, so their peak heap usage wouldn't be added together. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: if not null_page: : page_max_value = decod > I added the column name to this specific error message. I was thinking about catching the exception in _validate_parquet_page_index, adding the column name to its message or printing it, and then re-throwing the AssertionError. The same could be done for page index by using an iterator that is held in a member / argument, so its state is still intact in the caller function when the exception is caught, and using it in every loop to iterate over pages. The iterator stuff does probably not worth the effort, but knowing the name of the offending column could be really helpful if the test catches an error. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 29 Mar 2018 19:44:57 +0000 Gerrit-HasComments: Yes