Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Tim Armstrong Fri, 16 Feb 2018 08:38:47 -0800

There is an extensibility mechanism with the ColumnOrder union - I think
that was meant to avoid the need to add new stat fields?

Given that the bug was in the Parquet spec, we'll need to make a spec
change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
at the same time as fixing the gap in the spec.

It could make sense to declare that the default ordering for floats/doubles
is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
ordered) and readers should either implement the required logic to handle
that correctly (I had some ideas here:
https://issues.apache.org/jira/browse/IMPALA-6527?focusedCommentId=16366106&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
or ignore the stats.

On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <[email protected]> wrote:

> > We could have a similar problem
> > with not finding +0.0 values because a -0.0 is written to the max_value
> > field by some component that considers them the same.
>
> My hope is that the filtering would behave sanely, since -0.0 == +0.0
> under the real-number-inspired ordering, which is distinguished from
> total Ordering, and which is also what you get when you use the
> default C/C++ operators <, >, <=, ==, and so on.
>
> You can distinguish between -0.0 and +0.0 without using total ordering
> by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> ways to distinguish, I suspect, but that's the simplest one I recall
> at the moment.
>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Reply via email to