Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/9115#issuecomment-149048444
One corner-case that I'm curious about: is it possible to have cases where
the relative ordering of the distinct and cast operations can affect the
answer? I'm thinking about a case where you have two values which are unequal
but which become equal after casting, such as casting two floats to integers
(which involves a loss of precision).
My understanding of our current casting semantics (please correct me if I'm
wrong!) is that our automatic/implicit casts will only serve to widen the type
without a loss of precision. As a result, it might be fine to push the cast
after the distinct if it's an implicit cast. But what about explicit casts? If
a user chose to cast a double to an int inside of their `sum`, e.g.
`sum(distinct int(foo))`, then would this optimization be unsafe?
/cc @yhuai, who originally filed this JIRA.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]