GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/10466
[SPARK-12375] [ML] add handleinvalid for vectorindexer
jira: https://issues.apache.org/jira/browse/SPARK-12375
>"Add option for allowing unknown categories, probably via a parameter like
"allowUnknownCategories."
>If true, then handle unknown categories during transform by assigning them
to an extra category index.
>The API should resemble the API used for StringIndexer."
The PR simply follows the current behavior of StringIndexer, which does not
yet support extra category for unseen labels.
I would propose to extend the `HasHandleInvalid` with more options in this
or other PR, like,
1. adding option "allow" and categorize all unseen labels to -1. (for
customization, we need introduce another parameter).
2. just encourage users to use `handleInvalid` to specify the category
value for unseen labels. Yet this may be out of expectation for some users.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark handleinvalid
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10466.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10466
----
commit 6a0efede2b99a315895b1d3cccb9262ea845476c
Author: Yuhao Yang <[email protected]>
Date: 2015-12-24T02:43:48Z
add handleinvalid for vectorindexer
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]