GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/7414

    [Spark-9062] [ML] Change output type of Tokenizer to Array(String, true)

    jira: https://issues.apache.org/jira/browse/SPARK-9062 
    
    Currently output type of Tokenizer is Array(String, false), which is not 
compatible with Word2Vec and Other transformers since their input type is 
Array(String, true). Seq[String] in udf will be treated as Array(String, true) 
by default.
    
    I'm also thinking for Nullable columns, maybe tokenizer should return 
Array(null) for null value in the input. Thanks for any suggestion.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark tokenizer

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7414.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7414
    
----
commit c01bd7a53da76992b924c6419067c2de90071e85
Author: Yuhao Yang <[email protected]>
Date:   2015-07-15T05:12:53Z

    change output type of tokenizer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to