> My intuition would be to keep them as separate values. If you end up with negative zeros it probably means something. But I'm not a database expert.
I strongly disagree. It's not a database thing, it's a floating point number thing. If you're doing floating point arithmetic you can end up with -0/+0 from expressions that should be equivalent. There are some applications where the bit of information is useful, for example if a floating point number underflowed. Then applications may want to distinguish the two. But the equality operator from IEEE floating defines that +0 == -0 for good reasons - for most applications they are the same number. E.g. in Impala "select cast(-100.0 as float) * cast(0.0 as float);" gives you -0 and "select cast(100.0 as float) * cast(0.0 as float)" gives you 0. It's unusual to want mathematically equivalent expressions to be treated differently by a compute engine. The distinction between -0 and +0 *can* matter You end up in a world of pain if your equality relation and your hash function implementation are not aligned. To be concrete, in a hash table, if you have two values that hash to the same value but are not equal, that results in an unavoidable hash collision and potentially quadratic performance. If you have two values that are equal, but don't hash to the same value, then you'll get inconsistent results because you'll get a match if two values happen to hash to the same bucket, but not get a match if they don't. So it's really a question of how you want to define equality (and whether you want to have multiple definitions of equality for different purposes). Then your hash function needs to reflect that. I think considering how you want to define equality, then basing the hash function implementation on that is likely to lead to a better outcome than arguing about the hash function implementation first. You probably also want to think carefully about whether -0 < +0 and whether min(-0, +0) is -0, +0, or either; same with NaN. Mainly the thing is to document it and keep it consistent across implementations, but it has implications for things like min/max filtering in Parquet. We had bugs where, for example, a single NaN could result in min/max statistics in a Parquet file that were misinterpreted by various compute engines which incorrectly filtered out, e.g. https://issues.apache.org/jira/browse/IMPALA-6527. There was a lot of back and forth about what the right behaviour was there, but I think the main lesson is that you need to explicitly call out stuff like this and make sure everyone is on the same page, otherwise different people will make different assumptions. - Tim - Tim On Tue, Feb 26, 2019 at 7:22 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > If I understand your solution case 2 there are multiple underlying bit > values that are all interpreted as NaN > > On Tuesday, February 26, 2019, Ravindra Pindikura <ravin...@dremio.com> > wrote: > > > > > > > > On Feb 26, 2019, at 10:32 AM, Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > Implementing compute kernels that depend on hashing has raised a couple > > of > > > edge cases that are worth discussing. In particular > > > the following points need to be resolved (I opened a JIRA [1] to track > > the > > > fixes). In particular: > > > > > > 1. How to handle -0.0 and 0.0? > > > - Option 1: Collapse to a single value (this is more inline with > > ieee-754 > > > spec I believe) > > > - Option 2: Keep them as separate values (I believe this is how java > > > handles them) > > > 2. How handle NaN? > > > - Option 1: Do nothing with them (multiple values of NaN might occur in > > > hashtables) > > > - Option 2: Canonicalize to a single NaN (this is what java does) > > > > > > I haven't investigated how DB systems handle these (if anyone knows and > > can > > > chime in I would appreciate it). As a default, I think it might be > nice > > to > > > align the C++ implementation with the way Java handles them, but I > don't > > > have any strong opinions. > > > > I’m probably missing something obvious. But, why not use the raw > > 4-byte/8-byte value underneath (treat it as uint32/uint64) for the > hashing > > ? I’m assuming that will give 1 -> option 2, and 2 -> Option 2. > > > > > > > > > > Thanks, > > > Micah > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-4497 > > > > >