Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/14264 )
Change subject: IMPALA-8498: Write column index for floating types when NaN is not present ...................................................................... Patch Set 4: (3 comments) Sorry for joining late when the tests are already running. http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@438 PS4, Line 438: // IMPALA-7304: Don't write column index for floating-point columns until : // PARQUET-1222 is resolved. This is modified by: : // IMPALA-8498: Write column index for floating types when NaN is not present : if (std::is_floating_point<T>::value) { : if (type().type == TYPE_FLOAT) { : float* v = static_cast<float*>(value); : if (std::isnan(*v)) valid_column_index_ = false; : } else if (type().type == TYPE_DOUBLE) { : double* v = static_cast<double*>(value); : if (std::isnan(*v)) valid_column_index_ = false; : } else { : DCHECK(false); : } : } I have 3 minor issues about this block: - as it is related to stats, it seems more logical to me to move it to the end of this function, near page_stats_->Update(*CastValue(value)); - as T contains the info whether this a FLOAT or DOUBLE, I think this could simplified to if (std::is_floating_point<T>::value && std::isnan(*CastValue(value))) valid_column_index_ = false; - if (type().type == TYPE_FLOAT) seems like a runtime check and the change above would turn it to a compile time check + as we use CastValue at several places, I think it would be better to add T* v = CastValue(value); to the beginning of the function. http://gerrit.cloudera.org:8080/#/c/14264/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@452 PS4, Line 452: if (current_encoding_ == parquet::Encoding::PLAIN_DICTIONARY) optional: a potential optimization would be to do the NaN check here only for PLAIN pages, and add a similar check to dictionary finalization for DICTIONARY pages. With tests this may be too much work, but I think it worth a TODO. http://gerrit.cloudera.org:8080/#/c/14264/4/tests/query_test/test_parquet_page_index.py File tests/query_test/test_parquet_page_index.py: http://gerrit.cloudera.org:8080/#/c/14264/4/tests/query_test/test_parquet_page_index.py@461 PS4, Line 461: col_type = "FLOAT" I am bit worried about double being untested, and it seems simple to run the same tests with col_type = "DOUBLE" for both table creation and to be substituted in CAST expressions. -- To view, visit http://gerrit.cloudera.org:8080/14264 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic9d367500243c8ca142a16ebfeef6c841f013434 Gerrit-Change-Number: 14264 Gerrit-PatchSet: 4 Gerrit-Owner: Norbert Luksa <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Norbert Luksa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 26 Sep 2019 15:27:46 +0000 Gerrit-HasComments: Yes
