If a floating point column does not have NaN as a lower_bound or
upper_bound, must it contain no NaNs?

This question came up in parquet in
https://issues.apache.org/jira/browse/PARQUET-1222.

One reasonable choice would be to specify the use of the IEEE-754
totalOrder predicate:

https://en.wikipedia.org/wiki/IEEE_754#Total-ordering_predicate

In this implementation, if neither bound is NaN, then the column does not
contain NaN. After that it gets more complex: if upper_bound contains a NaN
with a negative sign (NaNs are signed), then the column contains ONLY
NaNs.  To add complexity, for the purpose skipping files, I suppose the
compute engines would have to be using totalOrder, not one of the usual
comparators like <=.

Another possibility is to count NaNs like NULLs are counted, with a
nan_value_counts and insist that lower and upper bounds must be numbers or
an infinity. I'm not sure then how a lower_bound would be set in a column
with no non-NaN values. Maybe it would just be left out of the map.

One other thing I'll note about floating-point weirdness: 0 can be signed.
However, -0 is equal to +0. Double however, they can be distinguished with
some operations.

So 1.0/0.0 = inf, but 1.0/-0.0 = -inf. Also, -0 is less than +0 in
totalOrder, so compute engines pruning files with totalOrder would need
lower_bound to respect the distinction between -0 and +0. Additionally, -0
and +0 have different bit patterns, which means that the hash of -0 and +0
are likely different, given the hash function defined in the spec of
hashLong(doubleToRawLongBits(v)), even though the floating point values are
"equal" for some definition of equal.

I'm not sure how important this last one is, since the spec says "floating
point types are not valid source values for partitioning", and I'm still
working on parsing the spec to understand why hashing needs to be defined
at all for hash values, if they aren't valid source values for partitioning.

Thanks!
Jim

Reply via email to