[
https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313996#comment-14313996
]
Augustin Borsu edited comment on SPARK-5566 at 2/11/15 9:58 AM:
----------------------------------------------------------------
https://github.com/apache/spark/pull/4504
I propose a tokenizer loosely based on the NLTK regexTokenizer.
I didn't create a standalone tokenizer in mllib that I wrap in ml as I don't
think a standalone tokenizer is necessarly needed in mllib but if people
disagree I can change that.
was (Author: augustinb):
We could use a tokenizer like this, but we would need to add regex and
Array[String] parameters type to be able to change those aprameters in a
crossvalidation.
https://github.com/apache/spark/pull/4504
> Tokenizer for mllib package
> ---------------------------
>
> Key: SPARK-5566
> URL: https://issues.apache.org/jira/browse/SPARK-5566
> Project: Spark
> Issue Type: New Feature
> Components: ML, MLlib
> Affects Versions: 1.3.0
> Reporter: Joseph K. Bradley
>
> There exist tokenizer classes in the spark.ml.feature package and in the
> LDAExample in the spark.examples.mllib package. The Tokenizer in the
> LDAExample is more advanced and should be made into a full-fledged public
> class in spark.mllib.feature. The spark.ml.feature.Tokenizer class should
> become a wrapper around the new Tokenizer.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]