[
https://issues.apache.org/jira/browse/SPARK-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933792#comment-14933792
]
Asoka Diggs edited comment on SPARK-10782 at 9/28/15 7:11 PM:
--------------------------------------------------------------
A reasonable sounding request, but I'm not familiar with the acronym (PR), and
this is my first time dipping my toe into reporting an issue. I will try to be
more specific, and may need a pointer to remedial education :)
EDIT: PR = Pull Request. I found the documentation about Contributing to Spark
and will puzzle my way through.
The change I propose making is in the documentation for drop_duplicates only
(tested locally in my Spark 1.5.0 pyspark instance to work):
<OLD line>
df.dropDuplicates().show()
<NEW line>
df.drop_duplicates().show()
A larger philosophical question - based on the documentation, it appears that
there are 3 implementations of the equivalent of SQL's DISTINCT clause:
distinct(), dropDuplicates(), and drop_duplicates(). The latter two support a
column list to work on, but are otherwise the same as distinct(). It seems
that ideally, all three of these are really 1 implementation behind the scenes,
with the other two listed as aliases.
This is hopefully a second update to the documentation (listing the 3 methods
as aliases of each other). In the worst case, this becomes a suggestion that
the 3 implementations get merged into 1, and the documentation updated to
indicate these are aliases.
was (Author: asoka.diggs):
A reasonable sounding request, but I'm not familiar with the acronym (PR), and
this is my first time dipping my toe into reporting an issue. I will try to be
more specific, and may need a pointer to remedial education :)
The change I propose making is in the documentation for drop_duplicates only
(tested locally in my Spark 1.5.0 pyspark instance to work):
<OLD line>
df.dropDuplicates().show()
<NEW line>
df.drop_duplicates().show()
A larger philosophical question - based on the documentation, it appears that
there are 3 implementations of the equivalent of SQL's DISTINCT clause:
distinct(), dropDuplicates(), and drop_duplicates(). The latter two support a
column list to work on, but are otherwise the same as distinct(). It seems
that ideally, all three of these are really 1 implementation behind the scenes,
with the other two listed as aliases.
This is hopefully a second update to the documentation (listing the 3 methods
as aliases of each other). In the worst case, this becomes a suggestion that
the 3 implementations get merged into 1, and the documentation updated to
indicate these are aliases.
> Duplicate examples for drop_duplicates and DropDuplicates
> ---------------------------------------------------------
>
> Key: SPARK-10782
> URL: https://issues.apache.org/jira/browse/SPARK-10782
> Project: Spark
> Issue Type: Documentation
> Components: Documentation
> Affects Versions: 1.5.0
> Reporter: Asoka Diggs
> Priority: Trivial
>
> In documentation for pyspark.sql, the source code examples for DropDuplicates
> and drop_duplicates are identical with each other. It appears that the
> example for DropDuplicates was copy/pasted for drop_duplicates and not edited.
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]