[
https://issues.apache.org/jira/browse/SPARK-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471366#comment-15471366
]
yuhao yang commented on SPARK-17094:
------------------------------------
Thanks for the comment, Sean. The two questions were great.
1. For the configuration, it might be something like
{code}
pipeline("tokenizer").asInstanceOf[Tokenizer].set...
pipeline(2).asInstanceOf[Tokenizer].set...
{code}
It will be great if there's a way to avoid the cast.
Eventually, I think it would be great to have configuration support for ML
transformers, thus we can do:
{code}
sc.set("ml.tokenizer.toLowercase", "false")
{code}
and configuration file support, which can avoid hard coding and provide great
support for tuning on cluster. (Anyone like the idea? cc [~josephkb] [~mengxr])
2. I'm thinking most users would only use linear pipeline. Could you please
provide an example for non-linear pipelines? So we can have a specific
discussion.
I tried your code yet I cannot find a constructor for Pipeline like that. Is it
something under development? And do we need to set the input column and output
column for each stage?
Overall, the feature would
1. Allow people (especially starters) to create a ML application in one simple
line of code.
2. And can be handy for users as they don't have to set the input, output
columns.
3. Thinking further, we may not need code any longer to build a Spark ML
application as it can be done by configuration:
{code}
"ml.pipeline": "tokenizer", "hashingTF", "lda"
"ml.tokenizer.toLowercase": "false"
...
{code}.
> provide simplified API for ML pipeline
> --------------------------------------
>
> Key: SPARK-17094
> URL: https://issues.apache.org/jira/browse/SPARK-17094
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: yuhao yang
>
> Many machine learning pipeline has the API for easily assembling transformers.
> One example would be:
> val model = new Pipeline("tokenizer", "countvectorizer", "lda").fit(data).
> Appreciate feedback and suggestions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]