[ https://issues.apache.org/jira/browse/PARQUET-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370706#comment-16370706 ]
Zoltán Borók-Nagy commented on PARQUET-1225: -------------------------------------------- Hi [~mdeepak], the proposed quick fix for Impala write path is described in IMPALA-6542 The proposed solution is very close to what is described on the [parquet-cpp mailing list|https://lists.apache.org/thread.html/2f9afec11d6dc11d0d9613a3bfb64c0b32dad8ebfdc30fa4252a8ec1@%3Cdev.parquet.apache.org%3E], ie. basically ignore NaNs. There is only a small difference between the two proposals for the case when all the values are NaN. I don't have a strong opinion on this very special edge case, but I think the parquet-cpp and Impala behavior should be aligned. > NaN values may lead to incorrect filtering under certain circumstances > ---------------------------------------------------------------------- > > Key: PARQUET-1225 > URL: https://issues.apache.org/jira/browse/PARQUET-1225 > Project: Parquet > Issue Type: Task > Components: parquet-cpp > Reporter: Zoltan Ivanfi > Assignee: Deepak Majeti > Priority: Major > > _This JIRA describes a generic problem with floating point comparisons that > *most probably* affects parquet-cpp. It is known to affect Impala and by > taking a quick look at the parquet-cpp code it seems to affect parquet-cpp as > well, but it has not yet been confirmed in practice._ > For comparing float and double values for min/max stats, parquet-cpp uses the > C++ less-than operator (<) that returns false for comparisons involving a > NaN. This means that while garthering statistics, if a NaN is the smallest > value encountered so far (which happens to be the case after reading the > first value if that value is NaN), no other value can ever replace it, since > < will always be false. On the other hand, if NaN is not the first value, it > won't affect the min value. So the min value depends on the order of elements. > If looking for specific values while reading back the data, the NaN value may > lead to row groups being incorrectly discarded in spite of having matching > rows. For details, please see the Impala bug IMPALA-6527. -- This message was sent by Atlassian JIRA (v7.6.3#76005)