[ 
https://issues.apache.org/jira/browse/SPARK-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933792#comment-14933792
 ] 

Asoka Diggs edited comment on SPARK-10782 at 9/28/15 7:11 PM:
--------------------------------------------------------------

A reasonable sounding request, but I'm not familiar with the acronym (PR), and 
this is my first time dipping my toe into reporting an issue.  I will try to be 
more specific, and may need a pointer to remedial education :)

EDIT: PR = Pull Request.  I found the documentation about Contributing to Spark 
and will puzzle my way through.


The change I propose making is in the documentation for drop_duplicates only 
(tested locally in my Spark 1.5.0 pyspark instance to work):
<OLD line>
df.dropDuplicates().show()

<NEW line>
df.drop_duplicates().show()


A larger philosophical question - based on the documentation, it appears that 
there are 3 implementations of the equivalent of SQL's DISTINCT clause:  
distinct(), dropDuplicates(), and drop_duplicates().  The latter two support a 
column list to work on, but are otherwise the same as distinct().  It seems 
that ideally, all three of these are really 1 implementation behind the scenes, 
with the other two listed as aliases.

This is hopefully a second update to the documentation (listing the 3 methods 
as aliases of each other).  In the worst case, this becomes a suggestion that 
the 3 implementations get merged into 1, and the documentation updated to 
indicate these are aliases.


was (Author: asoka.diggs):
A reasonable sounding request, but I'm not familiar with the acronym (PR), and 
this is my first time dipping my toe into reporting an issue.  I will try to be 
more specific, and may need a pointer to remedial education :)

The change I propose making is in the documentation for drop_duplicates only 
(tested locally in my Spark 1.5.0 pyspark instance to work):
<OLD line>
df.dropDuplicates().show()

<NEW line>
df.drop_duplicates().show()


A larger philosophical question - based on the documentation, it appears that 
there are 3 implementations of the equivalent of SQL's DISTINCT clause:  
distinct(), dropDuplicates(), and drop_duplicates().  The latter two support a 
column list to work on, but are otherwise the same as distinct().  It seems 
that ideally, all three of these are really 1 implementation behind the scenes, 
with the other two listed as aliases.

This is hopefully a second update to the documentation (listing the 3 methods 
as aliases of each other).  In the worst case, this becomes a suggestion that 
the 3 implementations get merged into 1, and the documentation updated to 
indicate these are aliases.

> Duplicate examples for drop_duplicates and DropDuplicates
> ---------------------------------------------------------
>
>                 Key: SPARK-10782
>                 URL: https://issues.apache.org/jira/browse/SPARK-10782
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 1.5.0
>            Reporter: Asoka Diggs
>            Priority: Trivial
>
> In documentation for pyspark.sql, the source code examples for DropDuplicates 
> and drop_duplicates are identical with each other.  It appears that the 
> example for DropDuplicates was copy/pasted for drop_duplicates and not edited.
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to