GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/7414
[Spark-9062] [ML] Change output type of Tokenizer to Array(String, true)
jira: https://issues.apache.org/jira/browse/SPARK-9062
Currently output type of Tokenizer is Array(String, false), which is not
compatible with Word2Vec and Other transformers since their input type is
Array(String, true). Seq[String] in udf will be treated as Array(String, true)
by default.
I'm also thinking for Nullable columns, maybe tokenizer should return
Array(null) for null value in the input. Thanks for any suggestion.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark tokenizer
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7414.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7414
----
commit c01bd7a53da76992b924c6419067c2de90071e85
Author: Yuhao Yang <[email protected]>
Date: 2015-07-15T05:12:53Z
change output type of tokenizer
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]