[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371550#comment-16371550
 ] 

ASF GitHub Bot commented on PARQUET-1225:
-----------------------------------------

wesm commented on a change in pull request #444: [WIP] PARQUET-1225: NaN values 
may lead to incorrect filtering under certaiā€¦
URL: https://github.com/apache/parquet-cpp/pull/444#discussion_r169674386
 
 

 ##########
 File path: src/parquet/statistics.cc
 ##########
 @@ -96,6 +96,67 @@ void TypedRowGroupStatistics<DType>::Reset() {
   has_min_max_ = false;
 }
 
+template <typename T>
+inline int getValueBeginOffset(const T* values, int64_t count) {
+  return 0;
+}
+
+template <typename T>
+inline int getValueEndOffset(const T* values, int64_t count) {
+  return count;
+}
+
+template <typename T>
+inline bool notNaN (const T* value) {
+  return true;
+}
+
+template <>
+inline int getValueBeginOffset<float>(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+     if (!std::isnan(values[i])) return i;
+  }
+  return count;
+}
+
+template <>
+inline int getValueEndOffset<float>(const float* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+     if (!std::isnan(values[i])) return (i + 1);
+  }
+  return count;
+}
+
+template <>
+inline bool notNaN<float>(const float* value) {
+  return !std::isnan(*value);
+}
+
+template <>
+inline int getValueBeginOffset<double>(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = 0; i < count; i++) {
+     if (!std::isnan(values[i])) return i;
+  }
+  return 0;
+}
+
+template <>
+inline int getValueEndOffset<double>(const double* values, int64_t count) {
+  // Skip NaNs
+  for (int64_t i = (count - 1); i > 0; i--) {
+     if (!std::isnan(values[i])) return (i + 1);
 
 Review comment:
   braces

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> NaN values may lead to incorrect filtering under certain circumstances
> ----------------------------------------------------------------------
>
>                 Key: PARQUET-1225
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1225
>             Project: Parquet
>          Issue Type: Task
>          Components: parquet-cpp
>            Reporter: Zoltan Ivanfi
>            Assignee: Deepak Majeti
>            Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to