GitHub user JoshRosen reopened a pull request:
https://github.com/apache/spark/pull/7194
[SPARK-8797] [WIP] Fix comparison of NaN values in Spark SQL
This patch addresses an issue where queries that sorted float or double
columns containing NaN values could fail with "Comparison method violates its
general contract!" errors from TimSort. The root of this problem is that `NaN
> anything`, `NaN == anything`, and `NaN < anything` all return `false`.
This is a tricky problem in general (see
https://stackoverflow.com/questions/23564120 for some discussion) and there
doesn't seem to be a standard solution. One approach would be to convert NaN
values to `null`, as is proposed in
[SPARK-6573](https://issues.apache.org/jira/browse/SPARK-6573), but this
approach would require `NaN` conversion calls to be added wherever we perform
floating point arithmetic which could produce NaN values.
The approach taken in this patch is to arbitrarily define `NaN` to be less
than any non-null floating point value, including negative infinity. If
desired, we could make this behavior configurable via a SQLConf setting.
Note that sorting is a building block for SQL operations that use clustered
inputs, such as distinct or group by; for these cases, the total ordering of
keys doesn't matter but we do need to ensure that values for the same key are
clustered together. I haven't tested all grouping operations with `NaN` values
yet, so there could still be latent bugs there (for example, `NaN == NaN` is
false according to the default Double equality comparison, so it's very likely
that COUNT DISTINCT operations on float columns containing NaN could return
unexpected results).
I've marked this as `[WIP]` because it's blocked on merging #7176 (and
possibly #7179) and because we need to figure out whether these are the right
semantics for NaN ordering.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark nan
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7194.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7194
----
commit d2b4a4a9a2139b1a6c2be5d1f1aa3d98a6c9ed99
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T03:18:05Z
Add random data generator test utilities to Spark SQL.
commit ab76cbd89bf800d590b7833f5a25c62df4ec2a95
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T04:37:38Z
Move code to Catalyst package.
commit 5acdd5ccf36487ba49815e8e0429f4c99558d427
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T05:15:13Z
Infinity and NaN are interesting.
commit b55875a05e4805cfdf2c3468a6cd50eec6a30578
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T05:23:55Z
Generate doubles and floats over entire possible range.
commit 7d5c13ea39cc0b811cc57b58b4214395026b1432
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T05:40:55Z
Add regression test for SPARK-8782 (ORDER BY NULL)
commit e7dc4fbb7c9e441c4367af7680c3acb42440ef33
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T06:15:49Z
Add very generic test for ordering
commit f9efbb5f317d28f8d38e1de9943fa9f976e8b5e5
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T06:17:28Z
Fix ORDER BY NULL
commit 13fc06a6e339eda0bb1b775c64fa6c1d78ba19bb
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T16:42:13Z
Add regression test for NaN sorting issue
commit 9bf195a716a2191e621f7aefba3db329aa7656e4
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T16:43:50Z
Re-enable NaNs in CodeGenerationSuite to produce more regression tests
commit 630ebc5756de8db0fe53e820ea70403c6d244ce3
Author: Josh Rosen <[email protected]>
Date: 2015-07-02T17:41:58Z
Specify an ordering for NaN values.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]