[GitHub] spark pull request: [SPARK-11027][SQL] Better group distinct colum...

JoshRosen Sun, 18 Oct 2015 14:24:07 -0700

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/9115#issuecomment-149048444
  
    One corner-case that I'm curious about: is it possible to have cases where 
the relative ordering of the distinct and cast operations can affect the 
answer? I'm thinking about a case where you have two values which are unequal 
but which become equal after casting, such as casting two floats to integers 
(which involves a loss of precision).
    
    My understanding of our current casting semantics (please correct me if I'm 
wrong!) is that our automatic/implicit casts will only serve to widen the type 
without a loss of precision. As a result, it might be fine to push the cast 
after the distinct if it's an implicit cast. But what about explicit casts? If 
a user chose to cast a double to an int inside of their `sum`, e.g. 
`sum(distinct int(foo))`, then would this optimization be unsafe?
    
    /cc @yhuai, who originally filed this JIRA.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11027][SQL] Better group distinct colum...

Reply via email to