Peter Ebert has posted comments on this change.
Change subject: KUDU-1386 NaN float and double values are not handled correctly
Patch Set 1:
Line 53: NONCOMPARABLE = 12
> > I'm not following, why would you want NaNs to show up in a per-block min/
Hey guys, I was talking to Will about this and wanted to put in unsolicited my
two cents if I could. Feel free to tell me I'm out of my element :)
I like how Impala has treated NaN where any comparison returns false. From a
mathematical standpoint I think we could all agree, this makes the most sense,
if you were writing a query to select all values >1 you would really scratch
your head when NaNs showed up in your result set (0.0/0.0 > 1). You would then
spend some time looking at the documentation to figure out why they were there,
if not just submit a support ticket with Cloudera (like some of my clients did
with doubles showing rounding errors XD). I could really see people/clients
wanting NaN to be treated more like NULL.
At the same time Impala has another precedent I like: NULL is considered
greater than all other values for sorting purposes. Would it be possible to
have our cake and eat it too? Where we sorted NaN for index purposes as
greater than all values, just like Impala NULLs, but never evaluated NaN = NaN
as true or returned NaNs where x > 1 (or x < 1)?
I'm not as familiar with the implementation of the indexing or operators and
what kind of complexity this would add, I'm sure there would be some, but I
think the complexity is worth the cost for being consistent with Impala, IEEE,
what I would consider mathematically consistent, and with the behavior average
users will expect.
If I were guessing how to implement, for doubles it might be as simple as when
someone requests x > 1, that we always add NaN > x > 1. If impala asks for
is_nan either an equivalent operator or x > inf might work. But I would
definitely want to catch the case where 'NaN' = 'NaN' because for most users I
presume if they have a query that evaluates to NaN on each side they do not
want those values to ever match each other. Technically letting infinities
match is also not quite correct, but it's part of the IEEE standard (as I
understand it) so the precedent for this is already out there.
I would also be ok with treating NaN as NULL and disallowing its presence in
index columns if that is the 'keep it simple stupid' solution. Should be
pretty rare that NaNs are intentionally used, but I could definitely see them
accidentally turning up and confusing clients when it doesn’t sort or compare
Thanks for reading. Let me know if you would prefer to chat instead if
anything isn't clear.
To view, visit http://gerrit.cloudera.org:8080/3142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Owner: Will Berkeley <wdberke...@gmail.com>
Gerrit-Reviewer: Dan Burkert <d...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Peter Ebert
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: Will Berkeley <wdberke...@gmail.com>