Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Jim Apple Tue, 20 Feb 2018 09:22:02 -0800

For that predicate in particular, does Impala use stats already?

Let's say a column contains only the intuitive notion of floats: no
NaNs, no infs, no -0.0. If we are filtering for $COL != a and the
row-group stats are b <= $COL <= c, were a < b, we can know that the
whole row group can be included. The addition of NaNs doesn't change
that.


OTOH, if b <= a <= c, then we have to check the whole row group, and
the addition of NaNs doesn't change that.

On Tue, Feb 20, 2018 at 9:14 AM, Alexander Behm <alex.b...@cloudera.com> wrote:
> On Mon, Feb 19, 2018 at 8:04 AM, Zoltan Ivanfi <z...@cloudera.com> wrote:
>
>> Hi,
>>
>> Tim, I added your suggestion to introduce a new ColumnOrder to PARQUET-1222
>> <https://issues.apache.org/jira/browse/PARQUET-1222> as the preferred
>> solution.
>>
>> Alex, not writing min/max if there is a NaN is indeed a feasible quick-fix,
>> but I think it would be better to just ignore NaN-s for the pruposes of
>> min/max stats. For reading, we can ignore stats that contain a NaN. We also
>> shouldn't use stats when looking for a NaN. -0 and +0 will still be
>> problematic, though.
>>
>
> I don't think ignoring NaNs is correct. Consider a predicate <col> !=
> <constant> that would evaluate to true against NaN. We cannot reliable use
> stats for such a predicate.
>
>
>>
>> Jim, fmax is indeed very close to IEEE-754's maxNum, but -0 and +0 are
>> implementation-dependent, az Zoltan Borok-Nagy pointed it out to me: "This
>> function is not required to be sensitive to the sign of zero, although some
>> implementations additionally enforce that if one argument is +0 and the
>> other is -0, then +0 is returned." [1
>> <http://en.cppreference.com/w/c/numeric/math/fmax>]
>>
>> Br,
>>
>> Zoltan
>>
>>
>>
>> On Fri, Feb 16, 2018 at 6:57 PM Jim Apple <jbap...@cloudera.com> wrote:
>>
>> > On Fri, Feb 16, 2018 at 9:44 AM, Zoltan Borok-Nagy
>> > <borokna...@cloudera.com> wrote:
>> > > I would just like to mention that the fmax() / fmin() functions in
>> C/C++
>> > > Math library follow the aforementioned IEEE 754-2008 min and max
>> > > specification:
>> > > http://en.cppreference.com/w/c/numeric/math/fmax
>> > >
>> > > I think this behavior is also the most intuitive and useful regarding
>> to
>> > > statistics. If we want to select the max value, I think it's reasonable
>> > to
>> > > ignore nulls and not-numbers.
>> >
>> > It should be noted that this is different than the total ordering
>> > predicate. With that predicate, -NaN < -inf < negative numbers < -0.0
>> > < +0.0 < positive numbers < +inf < +NaN
>> >
>> > fmax appears to be closest to IEEE-754's maxNum, but not quite
>> > matching for some corner cases (-0.0, signalling NaN), but I'm not
>> > 100% sure on that.
>> >
>>

Re: Inconsistent float/double sort order in spec and implementations can lead to incorrect results

Reply via email to