I hope the common cases is that data files do not contain these special
float values. As the simplest solution, how about writers refrain from
populating the stats if a special value is encountered?

That fix does not preclude a more thorough solution in the future, but it
addresses the common case quickly.

For existing data files we could check the writer version ignore filters on
float/double. I don't know whether min/max filtering is common on
float/double, but I suspect it's not.

On Fri, Feb 16, 2018 at 8:38 AM, Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> There is an extensibility mechanism with the ColumnOrder union - I think
> that was meant to avoid the need to add new stat fields?
>
> Given that the bug was in the Parquet spec, we'll need to make a spec
> change anyway, so we could add a new ColumnOrder - FloatingPointTotalOrder?
> at the same time as fixing the gap in the spec.
>
> It could make sense to declare that the default ordering for floats/doubles
> is not NaN-aware (i.e. the reader should assume that NaN was arbitrarily
> ordered) and readers should either implement the required logic to handle
> that correctly (I had some ideas here:
> https://issues.apache.org/jira/browse/IMPALA-6527?
> focusedCommentId=16366106&page=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16366106)
> or ignore the stats.
>
> On Fri, Feb 16, 2018 at 8:15 AM, Jim Apple <jbap...@cloudera.com> wrote:
>
> > > We could have a similar problem
> > > with not finding +0.0 values because a -0.0 is written to the max_value
> > > field by some component that considers them the same.
> >
> > My hope is that the filtering would behave sanely, since -0.0 == +0.0
> > under the real-number-inspired ordering, which is distinguished from
> > total Ordering, and which is also what you get when you use the
> > default C/C++ operators <, >, <=, ==, and so on.
> >
> > You can distinguish between -0.0 and +0.0 without using total ordering
> > by taking their reciprocal: 1.0/-0.0 is -inf. There are some other
> > ways to distinguish, I suspect, but that's the simplest one I recall
> > at the moment.
> >
>

Reply via email to