Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/14264 )
Change subject: IMPALA-8498: Write column index for floating types when NaN is not present ...................................................................... Patch Set 5: (8 comments) http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@444 PS4, Line 444: > std::isnan() could be put in an UNLIKELY macro. Done http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@438 PS4, Line 438: T* val = CastValue(value); : if (current_encoding_ == parquet::Encoding::PLAIN_DICTIONARY) { : if (UNLIKELY(num_values_since_dict_size_check_ >= : DICTIONARY_DATA_PAGE_SIZE_CHECK_PERIOD)) { : num_values_since_dict_size_check_ = 0; : if (dict_encoder_->EstimatedDataEncodedSize() >= page_size_) return false; : } : ++num_values_since_dict_size_check_; : *bytes_needed = dict_encoder_->Put(*val); : // If the dictionary contains the maximum number of values, switch to plain : // encoding for the next page. The current page is full and must be written out. : if (UNLIKELY(*bytes_needed < 0)) { : next_page_encoding_ = parquet::Encoding::PLAIN; : > I have 3 minor issues about this block: I have moved the block to the end, right before updating page stats. The problem I run into with your suggestion was that I cannot use CastValue since it returns a T*, and the compiler can't deduce from std::is_floating_point that T can be passed to std::isnan. But I understand your point regard compile/runtime checks, proposed a solution with only compile time checks. It separates the double and float type checks using std::is_same. +: Added the cast to the beginning of the function. http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@452 PS4, Line 452: } > optional: a potential optimization would be to do the NaN check here only f I would skip this optimization for now, not sure if it's worth the extra work, since this is only a workaround until the aforementioned Parquet issue is solved. http://gerrit.cloudera.org:8080/#/c/14264/2/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/14264/2/tests/query_test/test_parquet_page_index.py@429 PS2, Line 429: > flake8: E501 line too long (93 > 90 characters) Done http://gerrit.cloudera.org:8080/#/c/14264/2/tests/query_test/test_parquet_page_index.py@462 PS2, Line 462: > flake8: E116 unexpected indentation (comment) Done http://gerrit.cloudera.org:8080/#/c/14264/2/tests/query_test/test_parquet_page_index.py@473 PS2, Line 473: > flake8: E116 unexpected indentation (comment) Done http://gerrit.cloudera.org:8080/#/c/14264/2/tests/query_test/test_parquet_page_index.py@483 PS2, Line 483: > flake8: E116 unexpected indentation (comment) Done http://gerrit.cloudera.org:8080/#/c/14264/4/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/14264/4/tests/query_test/test_parquet_page_index.py@461 PS4, Line 461: for col_type in [" > I am bit worried about double being untested, and it seems simple to run th Done -- To view, visit http://gerrit.cloudera.org:8080/14264 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9d367500243c8ca142a16ebfeef6c841f013434 Gerrit-Change-Number: 14264 Gerrit-PatchSet: 5 Gerrit-Owner: Norbert Luksa <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Norbert Luksa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 03 Oct 2019 07:05:57 +0000 Gerrit-HasComments: Yes
