[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

Kai Jiang (JIRA) Mon, 27 Jun 2016 10:52:14 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351478#comment-15351478
 ]


Kai Jiang commented on SPARK-15767:
-----------------------------------

Here is detailed comparison:

✓ means MLlib matchs R. 
✗  means we should not include that parameter in API
?  means I am sure whether it should be included in wrapper API

So I propose we can wrap up decision tree regressor and classifier separately. 
These two methods can be implemented like `spark.decisionTreeRegressor` and 
`spark.decisionTreeClassifier`). I looked into 
code(https://github.com/apache/spark/blob/master/R/pkg/R/mllib.R#L117-L145) 
here. Just like, Yanbo uses `glm` API yield to R with `spark.glm`. Thus, I 
think it is a good way that we can follow MLlib API and do 
`spark.decisionTreeRegressor` and `spark.decisionTreeClassifier` first. Then, 
we could simply wrap up these two API with name `rpart` in mllib.R and make it 
R-compliant. 

Also, it is very similar to do Random Forest and GBT API.

Comparison:

MLlib:
checkpointInterval: IntParam        ?
impurity: Param\[String\]        ✓
maxBins: IntParam              ✓                                    
maxDepth: IntParam           ✓
minInfoGain: DoubleParam    ?
minInstancesPerNode: IntParam   ?
seed: LongParam                 ?

rpart:
rpart(formula, data, weights, subset, na.action = na.rpart, method, model = 
FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)

formula           ->   formula      ✓
data                ->   dataframe  ✓
weights           ->   (optional)  case weights  ✗
subset             ->   a subset of the rows of the data should be used in the 
fit.   ✗
na.action         ->   ✗
method            ->   one of "anova" , "class"   ✓
model, x, y      ->   ✗
parms              ->  (optional) parameters for the splitting function.     
                             "Anova" splitting has no parameters.
                             "classification" splitting, the list can contain 
any of: the vector of prior probabilities (component prior), the loss matrix 
(component loss) or the splitting index (component split). The priors must be 
positive and sum to 1. The loss matrix must have zeros on the diagonal and 
positive off-diagonal elements. The splitting index can be gini or information. 
The default priors are proportional to the data counts, the losses default to 
1, and the split defaults to gini.  ✓

rpart.control(minsplit = 20, minbucket = round(minsplit/3), cp = 0.01, 
maxcompete = 4, maxsurrogate = 5, usesurrogate = 2, xval = 10, surrogatestyle = 
0, maxdepth = 30, ...)

control            ->   Various parameters that control aspects of the rpart 
fit.   
minsplit  -> minimum number of observations that must exist in a node in order 
for a split to be attempted. ?
minbucket -> minimum number of observations in any terminal <leaf> node.  ✗
cp -> complexity parameter.   ✗
maxcompete -> number of competitor splits retained in the output.  ?
maxsurrogate -> the number of surrogate splits retained in the output. ✗
usesurrogate -> how to use surrogates in the splitting process.  ✗
xval -> number of cross-validations     ✗
surrogatestyle  -> controls the selection of a best surrogate.  ✗
maxdepth  -> Set the maximum depth of any node of the final tree.      ✓

> Decision Tree Regression wrapper in SparkR
> ------------------------------------------
>
>                 Key: SPARK-15767
>                 URL: https://issues.apache.org/jira/browse/SPARK-15767
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, SparkR
>            Reporter: Kai Jiang
>            Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

Reply via email to