[ 
https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370546#comment-16370546
 ] 

Zoltan Ivanfi commented on PARQUET-1225:
----------------------------------------

Please note there already is a [review request for an Impala 
workaround|https://gerrit.cloudera.org/#/c/9358/]. I think it would beneficial 
to agree on a common approach in order to have Impala and parquet-cpp handle 
the problem consistently.

> NaN values may lead to incorrect filtering under certain circumstances
> ----------------------------------------------------------------------
>
>                 Key: PARQUET-1225
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1225
>             Project: Parquet
>          Issue Type: Task
>          Components: parquet-cpp
>            Reporter: Zoltan Ivanfi
>            Assignee: Deepak Majeti
>            Priority: Major
>
> _This JIRA describes a generic problem with floating point comparisons that 
> *most probably* affects parquet-cpp. It is known to affect Impala and by 
> taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as 
> well, but it has not yet been confirmed in practice._
> For comparing float and double values for min/max stats, parquet-cpp uses the 
> C++ less-than operator (<) that returns false for comparisons involving a 
> NaN. This means that while garthering statistics, if a NaN is the smallest 
> value encountered so far (which happens to be the case after reading the 
> first value if that value is NaN), no other value can ever replace it, since 
> < will always be false. On the other hand, if NaN is not the first value, it 
> won't affect the min value. So the min value depends on the order of elements.
> If looking for specific values while reading back the data, the NaN value may 
> lead to row groups being incorrectly discarded in spite of having matching 
> rows. For details, please see the Impala bug IMPALA-6527.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to