Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 6:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@131
PS6, Line 131:       for column_info in columns:
The next block is a good candidate to move to a separate function.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@134
PS6, Line 134:         previous_loc = column_info.offset_index.page_locations[0]
             :         for current_loc in 
column_info.offset_index.page_locations[1:]:
             :           assert previous_loc.offset < current_loc.offset
             :           assert previous_loc.first_row_index < 
current_loc.first_row_index
             :           previous_loc = current_loc
I would move this to a separate function like "validate_page_locations".


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@139
PS6, Line 139:         null_pages = column_info.column_index.null_pages
The next empty line could be moved before this line to group it together with 
the related asserts.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@157
PS6, Line 157:         if column_info.stats:
This is more subjective than the comment in line 131, but I would move the next 
block to a separate function like "validate_index_stats_consistency".


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163:             for null_page in null_pages:
             :               assert null_page
This could be "assert False not in null_pages". Creating a variable to store 
whether all pages are null could be useful to check that if all pages are null, 
the min and max stats should not be set.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@166
PS6, Line 166:             return
This "return" should be "continue", but it would be even better to move to 
implement the suggestion in line 131 and keep this as a "return".


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@169
PS6, Line 169:           for null_page, value in zip(null_pages, min_values):
             :             if not null_page:
             :               current_value = 
decode_stats_value(column_info.schema, value)
             :               assert current_value >= min_value
             :
             :           max_value = decode_stats_value(column_info.schema, 
max_value_str)
             :           for null_page, value in zip(null_pages, max_values):
             :             if not null_page:
             :               current_value = 
decode_stats_value(column_info.schema, value)
             :               assert current_value <= max_value
These two for loops could be merged. I would also prefer more exact variable 
names, like min/max_value_in_page instead of current_value, min/max_value in 
column_chunk instead of min/max_value.

Another test ideas:

- the sum of page null counts should be equal to the total null count

- in most cases the minimum/maximum among the page min/max values should be 
equal to the column chunk's min/max value. As all the tests are checking tables 
created by Impala, we can be stricter/more specific than the Praquet spec. 
Special columns where this is not true could be listed, or their expected 
values could be set in a map to check the logic behind these values.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 27 Mar 2018 16:15:51 +0000
Gerrit-HasComments: Yes

Reply via email to