GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/12896

    [SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy

    This PR adds a param to `ALS`/`ALSModel` to set the strategy used when 
encountering unknown users or items at prediction time in `transform`. This can 
occur in 2 scenarios: (a) production scoring, and (b) cross-validation & 
evaluation.
    
    The current behavior returns `NaN` if a user/item is unknown. In scenario 
(b), this can easily occur when using `CrossValidator` or 
`TrainValidationSplit` since some users/items may only occur in the test set 
and not in the training set. In this case, the evaluator returns `NaN` for all 
metrics, making model selection impossible.
    
    The new param, `unknownStrategy`, defaults to `nan` (the current behavior). 
The other option supported initially is `drop`, which drops all rows with `NaN` 
predictions. This flag allows users to use `ALS` in cross-validation settings. 
It is made an `expertParam`. The param is made a string so that the set of 
strategies can be extended in future (some options are discussed in 
[SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)).
    
    ## How was this patch tested?
    
    New unit tests, and manual "before and after" tests for Scala & Python 
using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator` 
or `TrainValidationSplit` with the default param setting results in metrics 
that are all `NaN`, while setting `unknownStrategy` to `drop` results in valid 
metrics. 
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark SPARK-14489-als-nan

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12896.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12896
    
----
commit 69a7ab31e91026bccc8f06eff9c98b452f09b5f2
Author: Nick Pentreath <[email protected]>
Date:   2016-05-04T08:19:21Z

    Scala param and tests

commit 82ccb2136a0052ac97b2560c850f221bd19073dd
Author: Nick Pentreath <[email protected]>
Date:   2016-05-04T09:51:59Z

    Python param and tests

commit 933526defd27256cfd3b59fac28276a50262d1bf
Author: Nick Pentreath <[email protected]>
Date:   2016-05-04T10:16:40Z

    Improve test a bit

commit 3287303f627a312b7fccf6e7d47151938df21966
Author: Nick Pentreath <[email protected]>
Date:   2016-05-04T10:22:03Z

    Doc tweak

commit fc437451a598221f0878b7a2e0b87d17572019cc
Author: Nick Pentreath <[email protected]>
Date:   2016-05-04T10:23:33Z

    Doc remove space

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to