Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14264 )

Change subject: IMPALA-8498: Write column index for floating types when NaN is 
not present
......................................................................


Patch Set 4:

(3 comments)

Sorry for joining late when the tests are already running.

http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@438
PS4, Line 438:     // IMPALA-7304: Don't write column index for floating-point 
columns until
             :     // PARQUET-1222 is resolved. This is modified by:
             :     // IMPALA-8498: Write column index for floating types when 
NaN is not present
             :     if (std::is_floating_point<T>::value) {
             :       if (type().type == TYPE_FLOAT) {
             :         float* v = static_cast<float*>(value);
             :         if (std::isnan(*v)) valid_column_index_ = false;
             :       } else if (type().type == TYPE_DOUBLE) {
             :         double* v = static_cast<double*>(value);
             :         if (std::isnan(*v)) valid_column_index_ = false;
             :       } else {
             :         DCHECK(false);
             :       }
             :     }
I have 3 minor issues about this block:

- as it is related to stats, it seems more logical to me to move it to the end 
of this function, near page_stats_->Update(*CastValue(value));

- as T contains the info whether this a FLOAT or DOUBLE, I think this could 
simplified to if (std::is_floating_point<T>::value && 
std::isnan(*CastValue(value))) valid_column_index_ = false;

- if (type().type == TYPE_FLOAT) seems like a runtime check and the change 
above would turn it to a compile time check

+ as we use CastValue at several places, I think it would be better to add T* v 
= CastValue(value); to the beginning of the  function.


http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@452
PS4, Line 452: if (current_encoding_ == parquet::Encoding::PLAIN_DICTIONARY)
optional: a potential optimization would be to do the NaN check here only for 
PLAIN pages, and add a similar check to dictionary finalization for DICTIONARY 
pages. With tests this may be too much work, but I think it worth a TODO.


http://gerrit.cloudera.org:8080/#/c/14264/4/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/14264/4/tests/query_test/test_parquet_page_index.py@461
PS4, Line 461: col_type = "FLOAT"
I am bit worried about double being untested, and it seems simple to run the 
same tests with col_type = "DOUBLE" for both table creation and to be 
substituted in CAST expressions.



--
To view, visit http://gerrit.cloudera.org:8080/14264
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic9d367500243c8ca142a16ebfeef6c841f013434
Gerrit-Change-Number: 14264
Gerrit-PatchSet: 4
Gerrit-Owner: Norbert Luksa <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Norbert Luksa <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 26 Sep 2019 15:27:46 +0000
Gerrit-HasComments: Yes

Reply via email to