OK to summarize my understanding of the thoughts expressed: 1. People really shouldn't be trying to do things like grouping and joining on double valued columns (but they do). 2. The consensus (but not 100% agreement) : *Canonicalize NaNs and assume NaN == NaN, for group by/unique kernels * assume -0.0 == 0.0.
I can update the JIRA with these conclusions unless someone strongly disagrees. Thanks, Micah On Tue, Feb 26, 2019 at 11:54 AM Wes McKinney <wesmck...@gmail.com> wrote: > In an analytics setting my prior is that -0/+0 and all types of NaNs > should respectively be considered semantically to all be "the same > value". It would be confusing (and likely "wrong" in a practical > setting) to obtain two kinds of zeros as the output of an algorithm > involving a hash table, like Unique or ValueCounts. However: hashing > of floats should not be encouraged in general, but sometimes people > will hash the results of some operation that happens to yield floats. > > On Tue, Feb 26, 2019 at 1:49 PM Antoine Pitrou <solip...@pitrou.net> > wrote: > > > > On Tue, 26 Feb 2019 09:59:54 -0800 > > Tim Armstrong <tarmstr...@cloudera.com.INVALID> wrote: > > > It's not a database thing, it's a floating point > > > number thing. If you're doing floating point arithmetic you can end up > > > with -0/+0 from expressions that should be equivalent. > > > > But we are not exactly dealing with arithmetic here... I'm not sure > > the IEEE FP standard was designed with database joins in mind. > > > > Granted, float hashing and float equality may be of dubious utility. > > I'm curious about the use cases. > > > > > You end up in a world of pain if your equality relation and your hash > > > function implementation are not aligned. > > > > This is not what I am suggesting. > > > > > So it's really a question of how you want to define equality (and > whether > > > you want to have multiple definitions of equality for different > purposes). > > > > I think this is the goal of this discussion. > > > > Regards > > > > Antoine. > > > > >