> My intuition would be to keep them as separate values.  If you end up
with negative zeros it probably means something.  But I'm not a database
expert.

I strongly disagree. It's not a database thing, it's a floating point
number thing. If you're doing floating point arithmetic you can end up with
-0/+0 from expressions that should be equivalent. There are some
applications where the bit of information is useful, for example if a
floating point number underflowed. Then applications may want to
distinguish the two. But the equality operator from IEEE floating defines
that +0 == -0 for good reasons - for most applications they are the same
number.

E.g. in Impala "select cast(-100.0 as float) * cast(0.0 as float);" gives
you -0 and "select cast(100.0 as float) * cast(0.0 as float)" gives you 0.
It's unusual to want mathematically equivalent expressions to be treated
differently by a compute engine. The distinction between -0 and +0 *can*
matter

You end up in a world of pain if your equality relation and your hash
function implementation are not aligned. To be concrete, in a hash table,
if you have two values that hash to the same value but are not equal, that
results in an unavoidable hash collision and potentially quadratic
performance. If you have two values that are equal, but don't hash to the
same value, then you'll get inconsistent results because you'll get a match
if two values happen to hash to the same bucket, but not get a match if
they don't.

So it's really a question of how you want to define equality (and whether
you want to have multiple definitions of equality for different purposes).
Then your hash function needs to reflect that. I think considering how you
want to define equality, then basing the hash function implementation on
that is likely to lead to a better outcome than arguing about the hash
function implementation first.

You probably also want to think carefully about whether -0 < +0 and whether
min(-0, +0) is -0, +0, or either; same with NaN. Mainly the thing is to
document it and keep it consistent across implementations, but it has
implications for things like min/max filtering in Parquet. We had bugs
where, for example, a single NaN could result in min/max statistics in a
Parquet file that were misinterpreted by various compute engines which
incorrectly filtered out, e.g.
https://issues.apache.org/jira/browse/IMPALA-6527. There was a lot of back
and forth about what the right behaviour was there, but I think the main
lesson is that you need to explicitly call out stuff like this and make
sure everyone is on the same page, otherwise different people will make
different assumptions.

- Tim

- Tim

On Tue, Feb 26, 2019 at 7:22 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> If I understand your solution case 2 there are multiple underlying bit
> values that are all interpreted as NaN
>
> On Tuesday, February 26, 2019, Ravindra Pindikura <ravin...@dremio.com>
> wrote:
>
> >
> >
> > > On Feb 26, 2019, at 10:32 AM, Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> > >
> > > Implementing compute kernels that depend on hashing has raised a couple
> > of
> > > edge cases that are worth discussing.  In particular
> > > the following points need to be resolved (I opened a JIRA [1] to track
> > the
> > > fixes).  In particular:
> > >
> > > 1. How to handle -0.0 and 0.0?
> > > -  Option 1: Collapse to a single value (this is more inline with
> > ieee-754
> > > spec I believe)
> > > - Option 2: Keep them as separate values (I believe this is how java
> > > handles them)
> > > 2. How handle NaN?
> > > - Option 1: Do nothing with them (multiple values of NaN might occur in
> > > hashtables)
> > > - Option 2: Canonicalize to a single NaN (this is what java does)
> > >
> > > I haven't investigated how DB systems handle these (if anyone knows and
> > can
> > > chime in I would appreciate it).  As a default, I think it might be
> nice
> > to
> > > align the C++ implementation with the way Java handles them, but I
> don't
> > > have any strong opinions.
> >
> > I’m probably missing something obvious. But, why not use the raw
> > 4-byte/8-byte value underneath (treat it as uint32/uint64) for the
> hashing
> > ? I’m assuming that will give 1 -> option 2, and 2 -> Option 2.
> >
> >
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-4497
> >
> >
>

Reply via email to