[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

Sean Owen (JIRA) Tue, 20 Sep 2016 07:10:09 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15506649#comment-15506649
 ]


Sean Owen commented on SPARK-17094:
-----------------------------------

Sure, consider a pipeline that needs to convert several subsets of columns to 
categorical variables and then reassemble them. This is done with separate 
transformations of the source DataFrame, and then reassembled with 
VectorAssembler. It's not the case that each stage uses as its input column the 
previous stage's output column. I don't even think that's common given any 
non-trivial ETL pipeline upfront.

Consider a pipeline that builds several models off one set of input.

The case that you have a truly linear pipeline (output of one always is input 
to next) with no other configuration at all is rare, I think. It's also already 
about as easy with the current API.

> provide simplified API for ML pipeline
> --------------------------------------
>
>                 Key: SPARK-17094
>                 URL: https://issues.apache.org/jira/browse/SPARK-17094
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> {code}
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> {code}
> Overall, the feature would 
> 1. Allow people (especially starters) to create a ML application in one 
> simple line of code. 
> 2. And can be handy for users as they don't have to set the input, output 
> columns.
> 3. Thinking further, we may not need code any longer to build a Spark ML 
> application as it can be done by configuration:
> {code}
> "ml.pipeline.input": "hdfs://path.svm"
> "ml.pipeline": "tokenizer", "hashingTF", "lda"
> "ml.tokenizer.toLowercase": "false"
> ...
> {code}, which can be quite efficient for tuning on cluster.
> Appreciate feedback and suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17094) provide simplified API for ML pipeline

Reply via email to