Github user asokadiggs commented on the pull request:

    https://github.com/apache/spark/pull/8647#issuecomment-144508693
  
    I've encountered the same problem @GayathriMurali been looking at here and 
separately reported and fixed it in [SPARK-10782].  There, the specific issue 
is that drop_duplicates doesn't exist as an independent method definition - 
around line 1280-1285 of dataframe.py you'll find "drop_duplicates = 
dropDuplicates" for the method definition.  So there is no separate 
documentation string for drop_duplicates.
    
    I "fixed" that problem by adding to the dropDuplicates documentation the 
note that drop_duplicates is an alias for dropDuplicates, with the presumption 
that the reader can substitute drop_duplicates in the code example for 
dropDuplicates on their own.
    
    That works for a few instances, but I'm thinking it's pointing to a larger 
issue that may warrant a different approach.
    
    In the case of dropDuplicates and drop_duplicates, there is also the method 
distinct.  All three do the same thing, but distinct is NOT implemented as an 
alias of dropDuplicates.  So we have 2 implementations for 3 methods, all 
intended to do 1 thing (remove duplicate rows from a dataframe).  That doesn't 
sound healthy to me.
    
    Another issue with the aliasing approach currently being used is that the 
"@since" property is attached to the single method definition.  So if we were 
to alias distinct to dropDuplicates for 1.6.0, the documentation would indicate 
distinct has existed since 1.4.0 (rather than either the 1.3.0 when it was 
actually created, or the 1.6.0 when it was aliased to dropDuplicates).
    
    
    End result - this is a bit of documentation and code structure that I'm 
interested in helping out with, but I lack the skills or background to carry 
this myself.  @GayathriMurali - I am interested in assisting if you are still 
working this.  I have some thoughts about larger design intent that can guide 
individual fixes for different methods - what is the best forum for documenting 
and discussing these?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to