Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 6: (7 comments) http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131 PS6, Line 131: for column_info in columns: The next block is a good candidate to move to a separate function. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@134 PS6, Line 134: previous_loc = column_info.offset_index.page_locations[0] : for current_loc in column_info.offset_index.page_locations[1:]: : assert previous_loc.offset < current_loc.offset : assert previous_loc.first_row_index < current_loc.first_row_index : previous_loc = current_loc I would move this to a separate function like "validate_page_locations". http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@139 PS6, Line 139: null_pages = column_info.column_index.null_pages The next empty line could be moved before this line to group it together with the related asserts. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@157 PS6, Line 157: if column_info.stats: This is more subjective than the comment in line 131, but I would move the next block to a separate function like "validate_index_stats_consistency". http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163 PS6, Line 163: for null_page in null_pages: : assert null_page This could be "assert False not in null_pages". Creating a variable to store whether all pages are null could be useful to check that if all pages are null, the min and max stats should not be set. http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@166 PS6, Line 166: return This "return" should be "continue", but it would be even better to move to implement the suggestion in line 131 and keep this as a "return". http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@169 PS6, Line 169: for null_page, value in zip(null_pages, min_values): : if not null_page: : current_value = decode_stats_value(column_info.schema, value) : assert current_value >= min_value : : max_value = decode_stats_value(column_info.schema, max_value_str) : for null_page, value in zip(null_pages, max_values): : if not null_page: : current_value = decode_stats_value(column_info.schema, value) : assert current_value <= max_value These two for loops could be merged. I would also prefer more exact variable names, like min/max_value_in_page instead of current_value, min/max_value in column_chunk instead of min/max_value. Another test ideas: - the sum of page null counts should be equal to the total null count - in most cases the minimum/maximum among the page min/max values should be equal to the column chunk's min/max value. As all the tests are checking tables created by Impala, we can be stricter/more specific than the Praquet spec. Special columns where this is not true could be listed, or their expected values could be set in a map to check the logic behind these values. -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Anonymous Coward #248 Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 27 Mar 2018 16:15:51 +0000 Gerrit-HasComments: Yes
