[
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Deepak Majeti reassigned PARQUET-1225:
--------------------------------------
Assignee: Deepak Majeti
> NaN values may lead to incorrect filtering under certain circumstances
> ----------------------------------------------------------------------
>
> Key: PARQUET-1225
> URL: https://issues.apache.org/jira/browse/PARQUET-1225
> Project: Parquet
> Issue Type: Task
> Components: parquet-cpp
> Reporter: Zoltan Ivanfi
> Assignee: Deepak Majeti
> Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that
> *most probably* affects parquet-cpp. It is known to affect Impala and by
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the
> C++ less-than operator (<) that return false for comparisons involving a NaN.
> This means that while garthering statistics, if a NaN is the smallest value
> encountered so far (which happens to be the case after reading the first
> value if that value is NaN), no other value can ever replace it, since < will
> always be false. On the other hand, if NaN is not the first value, it won't
> affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may
> lead to row groups being incorrectly discarded in spite of having matching
> rows. For details, please see the Imapala bug IMPALA-6527.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)