> On Feb 26, 2019, at 10:32 AM, Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Implementing compute kernels that depend on hashing has raised a couple of
> edge cases that are worth discussing. In particular
> the following points need to be resolved (I opened a JIRA [1] to track the
> fixes). In particular:
>
> 1. How to handle -0.0 and 0.0?
> - Option 1: Collapse to a single value (this is more inline with ieee-754
> spec I believe)
> - Option 2: Keep them as separate values (I believe this is how java
> handles them)
> 2. How handle NaN?
> - Option 1: Do nothing with them (multiple values of NaN might occur in
> hashtables)
> - Option 2: Canonicalize to a single NaN (this is what java does)
>
> I haven't investigated how DB systems handle these (if anyone knows and can
> chime in I would appreciate it). As a default, I think it might be nice to
> align the C++ implementation with the way Java handles them, but I don't
> have any strong opinions.
I’m probably missing something obvious. But, why not use the raw 4-byte/8-byte
value underneath (treat it as uint32/uint64) for the hashing ? I’m assuming
that will give 1 -> option 2, and 2 -> Option 2.
>
> Thanks,
> Micah
>
> [1] https://issues.apache.org/jira/browse/ARROW-4497