[ 
https://issues.apache.org/jira/browse/SPARK-10962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhijit Deb updated SPARK-10962:
--------------------------------
    Affects Version/s: 1.5.0
             Priority: Critical  (was: Major)
          Description: We are trying to find the duplicates in a DataFrame. We 
first get the uniques and then we are trying to get the duplicates using 
"except". While the uniques is quite fast, but getting the duplicates using 
"except" is tremendously slow. What will be the best way to get the duplicates 
- getting just the uniques is not sufficient in most use cases. 
          Component/s: SQL
              Summary: DataFrame "except" method...  (was: DataFrame "except)

> DataFrame "except" method...
> ----------------------------
>
>                 Key: SPARK-10962
>                 URL: https://issues.apache.org/jira/browse/SPARK-10962
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Abhijit Deb
>            Priority: Critical
>
> We are trying to find the duplicates in a DataFrame. We first get the uniques 
> and then we are trying to get the duplicates using "except". While the 
> uniques is quite fast, but getting the duplicates using "except" is 
> tremendously slow. What will be the best way to get the duplicates - getting 
> just the uniques is not sufficient in most use cases. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to