[GitHub] spark pull request #18025: [WIP][SparkR] Update doc and examples for sql fun...

felixcheung Fri, 19 May 2017 21:53:55 -0700

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18025#discussion_r117602305
  
    --- Diff: R/pkg/R/generics.R ---
    @@ -396,67 +397,81 @@ setGeneric("agg", function (x, ...) { 
standardGeneric("agg") })
     #' @param object x a SparkDataFrame or a Column
     #' @param data new name to use
     #' @return a SparkDataFrame or a Column
    +#' @noRd
     NULL
     
     #' @rdname arrange
     #' @export
    +#' @noRd
     setGeneric("arrange", function(x, col, ...) { standardGeneric("arrange") })
     
     #' @rdname as.data.frame
     #' @export
    +#' @noRd
     setGeneric("as.data.frame",
                function(x, row.names = NULL, optional = FALSE, ...) {
                  standardGeneric("as.data.frame")
                })
     
     #' @rdname attach
     #' @export
    +#' @noRd
     setGeneric("attach")
     
     #' @rdname cache
     #' @export
    +#' @noRd
     setGeneric("cache", function(x) { standardGeneric("cache") })
     
     #' @rdname checkpoint
     #' @export
    +#' @noRd
     setGeneric("checkpoint", function(x, eager = TRUE) { 
standardGeneric("checkpoint") })
     
     #' @rdname coalesce
     #' @param x a Column or a SparkDataFrame.
    --- End diff --
    
    wait, I think some wires are crossed. let me clarify. let's take `coalesce` 
as an example.
    
    there are 2 coalesce, one `coalesce(df)` like repartition, and one 
`coalesce(df$foo)` on a column like in SQL. so therefore, these are in fact 2 
`coalesce` and `x is either a SparkDataFrame or a Column`.
    
    and to elaborate, the history behind that approach we have today is because 
we use to have this
    ```
    @param x a Column
    ...
    function(x = "Column"...)
    ```
    and at the same time, in a different .R file
    ```
    @param x a SparkDataFrame
    ...
    function(x = "SparkDataFrame"
    ```
    they seem fine when we write the code and it seems logical/easy to 
maintain, but when the rd/doc page is generated it has
    ```
    x a SparkDataFrame
    x a Column
    ```
    here `x` is explained twice. worse, the order is largely random (it's the 
alphabetic order of the .R file)
    
    and it is going against the standard R pattern of one description line for 
each parameter with the choice of type separated by `.. or..` like 
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html, and not to 
mention CRAN check I think complain about parameter documented more than once.
    
    so in short, having one `@param x a SparkDataFrame or a Column` is 
intentional. since this is describing 2 things, from discussions back then it 
feels more nature to put it some where independent - in fact like you are 
touching on, I'd argue it's better to look it up in generic.R rather than 
trying to figure out what other existing overload class that method already has.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18025: [WIP][SparkR] Update doc and examples for sql fun...

Reply via email to