Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9693 )
Change subject: IMPALA-5842: Write page index in Parquet files ...................................................................... Patch Set 4: (17 comments) I have added a lot of nits, feel free to skip them for now if you think that it is too early to deal with such things at the moment. http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300 PS2, Line 300: std::vector<bool> null_pages_; > OK, I'll implement this part after that. About long strings: I think that it would be safer to start with a smaller max size than 1KB, for example 64, to avoid making inserts slower / the written files larger. Truncating long string can only make performance worse if a query contains =/</> comparison with a string longer than the truncated length, which seems a non-typical use case to me. http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@159 PS4, Line 159: // If true, min/max values are ascending nit: missing . http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@162 PS4, Line 162: // if true min/max values are descending nit: missing . http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h@207 PS4, Line 207: if (prev_page_min_buffer_.IsEmpty()) RETURN_IF_ERROR(CopyToBuffer(&prev_page_min_buffer_, &prev_page_min_value_)); : if (prev_page_max_buffer_.IsEmpty()) RETURN_IF_ERROR(CopyToBuffer(&prev_page_max_buffer_, &prev_page_max_value_)); nit: long lines http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@71 PS4, Line 71: column_index_length) nit: not enough indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@82 PS4, Line 82: offset_index_length) nit: not enough indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@85 PS4, Line 85: page_header = get_page_header(parquet_file, page_loc.offset, : page_loc.compressed_page_size) : page_headers.append(page_header) nit: too much indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@113 PS4, Line 113: print schema Are these prints intentional? http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@118 PS4, Line 118: print current_value Is this print intentional? http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@120 PS4, Line 120: assert current_value >= previous_value nit: too much indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@122 PS4, Line 122: assert current_value <= previous_value nit: too much indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@164 PS4, Line 164: # If either is None, then both need to be None. : assert min_value_str is None and max_value_str is None : for null_page in null_pages: : assert null_page : # Everything is None, no further checks needed. : return nit: too much indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@200 PS4, Line 200: qualified_table_name, source_table) nit: not enough indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@203 PS4, Line 203: ).format(qualified_table_name, sorting_column, source_table) nit: not enough indentation http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@244 PS4, Line 244: """Test table with wide row""" nit: missing . http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@249 PS4, Line 249: """Test tables with wide rows and many columns""" nit: missing . http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py@169 PS4, Line 169: def get_column_index(filename, file_pos, length): : """Reads a ColumnIndex object from a file at the given position.""" : serialized_column_index = read_serialized_object(filename, file_pos, length) : protocol = create_protocol(serialized_column_index) : column_index = ColumnIndex() : column_index.read(protocol) : return column_index : : : def get_page_header(filename, file_pos, length): : """Reads a PageHeader object from a file at the given position.""" : serialized_page_header = read_serialized_object(filename, file_pos, length) : protocol = create_protocol(serialized_page_header) : page_header = PageHeader() : page_header.read(protocol) : return page_header : : : def get_offset_index(filename, file_pos, length): : """Reads an OffsetIndex object from a file at the given position.""" : serialized_offset_index = read_serialized_object(filename, file_pos, length) : protocol = create_protocol(serialized_offset_index) : offset_index = OffsetIndex() : offset_index.read(protocol) : return offset_index nit: too much indentation -- To view, visit http://gerrit.cloudera.org:8080/9693 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Gerrit-Change-Number: 9693 Gerrit-PatchSet: 4 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 26 Mar 2018 14:22:28 +0000 Gerrit-HasComments: Yes
