[ 
https://issues.apache.org/jira/browse/SPARK-14831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254447#comment-15254447
 ] 

Shivaram Venkataraman commented on SPARK-14831:
-----------------------------------------------

Yeah I think there are a couple of factors to consider here

1. Existing R users who want to use SparkR: For this case I think its valuable 
to have the methods mimic the ordering that is used by the corresponding R 
function. So we will then have kmeans(data, centers, ...) and glm(formula, 
family, data, ...) . I think its useful to mimic the ordering for two reasons 
(a) its helps with familiarity (b) it also ensures we can safely override the 
base R functions as they are now

2. New users for SparkR / Spark-ML: I think having internal consistency is 
useful for these users. My take on SparkR API has always been that it doesn't 
hurt to support multiple ways to do things as long they don't collide etc. In 
this scenario if we want to define a new set of consistent APIs we should adopt 
a new namespace as [~mengxr] indicated. I would suggest `spark.kmeans` and 
`spark.glm` as opposed to `ml.glm` to make it more clear these are SparkR 
functions (we are also using spark.lapply for example)

> Make ML APIs in SparkR consistent
> ---------------------------------
>
>                 Key: SPARK-14831
>                 URL: https://issues.apache.org/jira/browse/SPARK-14831
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SparkR
>    Affects Versions: 2.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> In current master, we have 4 ML methods in SparkR:
> {code:none}
> glm(formula, family, data, ...)
> kmeans(data, centers, ...)
> naiveBayes(formula, data, ...)
> survreg(formula, data, ...)
> {code}
> We tried to keep the signatures similar to existing ones in R. However, if we 
> put them together, they are not consistent. One example is k-means, which 
> doesn't accept a formula. Instead of looking at each method independently, we 
> might want to update the signature of kmeans to
> {code:none}
> kmeans(formula, data, centers, ...)
> {code}
> We can also discuss possible global changes here. For example, `glm` puts 
> `family` before `data` while `kmeans` puts `centers` after `data`. This is 
> not consistent. And logically, the formula doesn't mean anything without 
> associating with a DataFrame. So it makes more sense to me to have the 
> following signature:
> {code:none}
> algorithm(df, formula, [required params], [optional params])
> {code}
> If we make this change, we might want to avoid name collisions because they 
> have different signature. We can use `ml.kmeans`, 'ml.glm`, etc.
> Sorry for discussing API changes in the last minute. But I think it would be 
> better to have consistent signatures in SparkR.
> cc: [~shivaram] [~josephkb] [~yanboliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to