Github user asokadiggs commented on the pull request:
https://github.com/apache/spark/pull/8647#issuecomment-144508693
I've encountered the same problem @GayathriMurali been looking at here and
separately reported and fixed it in [SPARK-10782]. There, the specific issue
is that drop_duplicates doesn't exist as an independent method definition -
around line 1280-1285 of dataframe.py you'll find "drop_duplicates =
dropDuplicates" for the method definition. So there is no separate
documentation string for drop_duplicates.
I "fixed" that problem by adding to the dropDuplicates documentation the
note that drop_duplicates is an alias for dropDuplicates, with the presumption
that the reader can substitute drop_duplicates in the code example for
dropDuplicates on their own.
That works for a few instances, but I'm thinking it's pointing to a larger
issue that may warrant a different approach.
In the case of dropDuplicates and drop_duplicates, there is also the method
distinct. All three do the same thing, but distinct is NOT implemented as an
alias of dropDuplicates. So we have 2 implementations for 3 methods, all
intended to do 1 thing (remove duplicate rows from a dataframe). That doesn't
sound healthy to me.
Another issue with the aliasing approach currently being used is that the
"@since" property is attached to the single method definition. So if we were
to alias distinct to dropDuplicates for 1.6.0, the documentation would indicate
distinct has existed since 1.4.0 (rather than either the 1.3.0 when it was
actually created, or the 1.6.0 when it was aliased to dropDuplicates).
End result - this is a bit of documentation and code structure that I'm
interested in helping out with, but I lack the skills or background to carry
this myself. @GayathriMurali - I am interested in assisting if you are still
working this. I have some thoughts about larger design intent that can guide
individual fixes for different methods - what is the best forum for documenting
and discussing these?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]