Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 4:

(17 comments)

I have added a lot of nits, feel free to skip them for now if you think that it 
is too early to deal with such things at the moment.

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector<bool> null_pages_;
> OK, I'll implement this part after that.
About long strings: I think that it would be safer to start with a smaller max 
size than 1KB, for example 64, to avoid making inserts slower / the written 
files larger. Truncating long string can only make performance worse if a query 
contains =/</> comparison with a string longer than the truncated length, which 
seems a non-typical use case to me.


http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@159
PS4, Line 159:   // If true, min/max values are ascending
nit: missing .


http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.h@162
PS4, Line 162:   // if true min/max values are descending
nit: missing .


http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/9693/4/be/src/exec/parquet-column-stats.inline.h@207
PS4, Line 207:   if (prev_page_min_buffer_.IsEmpty()) 
RETURN_IF_ERROR(CopyToBuffer(&prev_page_min_buffer_, &prev_page_min_value_));
             :   if (prev_page_max_buffer_.IsEmpty()) 
RETURN_IF_ERROR(CopyToBuffer(&prev_page_max_buffer_, &prev_page_max_value_));
nit: long lines


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@71
PS4, Line 71:           column_index_length)
nit: not enough indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@82
PS4, Line 82:         offset_index_length)
nit: not enough indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@85
PS4, Line 85:             page_header = get_page_header(parquet_file, 
page_loc.offset,
            :                     page_loc.compressed_page_size)
            :             page_headers.append(page_header)
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@113
PS4, Line 113:     print schema
Are these prints intentional?


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@118
PS4, Line 118:         print current_value
Is this print intentional?


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@120
PS4, Line 120:             assert current_value >= previous_value
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@122
PS4, Line 122:             assert current_value <= previous_value
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@164
PS4, Line 164:               # If either is None, then both need to be None.
             :               assert min_value_str is None and max_value_str is 
None
             :               for null_page in null_pages:
             :                   assert null_page
             :               # Everything is None, no further checks needed.
             :               return
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@200
PS4, Line 200:         qualified_table_name, source_table)
nit: not enough indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@203
PS4, Line 203:         ).format(qualified_table_name, sorting_column, 
source_table)
nit: not enough indentation


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@244
PS4, Line 244:     """Test table with wide row"""
nit: missing .


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/query_test/test_parquet_page_index.py@249
PS4, Line 249:     """Test tables with wide rows and many columns"""
nit: missing .


http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

http://gerrit.cloudera.org:8080/#/c/9693/4/tests/util/get_parquet_metadata.py@169
PS4, Line 169: def get_column_index(filename, file_pos, length):
             :     """Reads a ColumnIndex object from a file at the given 
position."""
             :     serialized_column_index = read_serialized_object(filename, 
file_pos, length)
             :     protocol = create_protocol(serialized_column_index)
             :     column_index = ColumnIndex()
             :     column_index.read(protocol)
             :     return column_index
             :
             :
             : def get_page_header(filename, file_pos, length):
             :     """Reads a PageHeader object from a file at the given 
position."""
             :     serialized_page_header = read_serialized_object(filename, 
file_pos, length)
             :     protocol = create_protocol(serialized_page_header)
             :     page_header = PageHeader()
             :     page_header.read(protocol)
             :     return page_header
             :
             :
             : def get_offset_index(filename, file_pos, length):
             :     """Reads an OffsetIndex object from a file at the given 
position."""
             :     serialized_offset_index = read_serialized_object(filename, 
file_pos, length)
             :     protocol = create_protocol(serialized_offset_index)
             :     offset_index = OffsetIndex()
             :     offset_index.read(protocol)
             :     return offset_index
nit: too much indentation



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 26 Mar 2018 14:22:28 +0000
Gerrit-HasComments: Yes

Reply via email to