Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector<bool> null_pages_;
> About leaving max_value unset: we cannot really do that, because 'max_value
About the formula for prefix size: what would happen, if there is no common 
prefix? If num_values is small (like 64 in the plain encoded 1k length case), 
the result would be 0.

About the untracked memory usage: the the min/max values could be collected by 
page_stats_base_, which knows the type of the column, so it wouldn't need to 
store fixed length data as string. When a data page is over, it could write its 
min/max value to a buffer. The min/max values would still had to be converted 
to string for thrift serialization, but columns could do this one-by-one, so 
their peak heap usage wouldn't be added together.


http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/6/tests/query_test/test_parquet_page_index.py@163
PS6, Line 163:       if not null_page:
             :         page_max_value = decod
> I added the column name to this specific error message.
I was thinking about catching the exception in _validate_parquet_page_index, 
adding the column name to its message or printing it, and then re-throwing the 
AssertionError.

The same could be done for page index by using an iterator that is held in a 
member / argument, so its state is still intact in the caller function when the 
exception is caught, and using it in every loop to iterate over pages.

The iterator stuff does probably not worth the effort, but knowing the name of 
the offending column could be really helpful if the test catches an error.



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward #248
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 29 Mar 2018 19:44:57 +0000
Gerrit-HasComments: Yes

Reply via email to