GitHub user MLnick opened a pull request:
https://github.com/apache/spark/pull/12896
[SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy
This PR adds a param to `ALS`/`ALSModel` to set the strategy used when
encountering unknown users or items at prediction time in `transform`. This can
occur in 2 scenarios: (a) production scoring, and (b) cross-validation &
evaluation.
The current behavior returns `NaN` if a user/item is unknown. In scenario
(b), this can easily occur when using `CrossValidator` or
`TrainValidationSplit` since some users/items may only occur in the test set
and not in the training set. In this case, the evaluator returns `NaN` for all
metrics, making model selection impossible.
The new param, `unknownStrategy`, defaults to `nan` (the current behavior).
The other option supported initially is `drop`, which drops all rows with `NaN`
predictions. This flag allows users to use `ALS` in cross-validation settings.
It is made an `expertParam`. The param is made a string so that the set of
strategies can be extended in future (some options are discussed in
[SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)).
## How was this patch tested?
New unit tests, and manual "before and after" tests for Scala & Python
using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator`
or `TrainValidationSplit` with the default param setting results in metrics
that are all `NaN`, while setting `unknownStrategy` to `drop` results in valid
metrics.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MLnick/spark SPARK-14489-als-nan
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/12896.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #12896
----
commit 69a7ab31e91026bccc8f06eff9c98b452f09b5f2
Author: Nick Pentreath <[email protected]>
Date: 2016-05-04T08:19:21Z
Scala param and tests
commit 82ccb2136a0052ac97b2560c850f221bd19073dd
Author: Nick Pentreath <[email protected]>
Date: 2016-05-04T09:51:59Z
Python param and tests
commit 933526defd27256cfd3b59fac28276a50262d1bf
Author: Nick Pentreath <[email protected]>
Date: 2016-05-04T10:16:40Z
Improve test a bit
commit 3287303f627a312b7fccf6e7d47151938df21966
Author: Nick Pentreath <[email protected]>
Date: 2016-05-04T10:22:03Z
Doc tweak
commit fc437451a598221f0878b7a2e0b87d17572019cc
Author: Nick Pentreath <[email protected]>
Date: 2016-05-04T10:23:33Z
Doc remove space
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]