Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/12896
I think there is a fair bit of difference between cross-validating the
model and scoring in production.
In most practical live scoring situations, there may be multiple levels of
fallbacks / defaults for the cold-start case (including e.g. "most popular",
"newest", "content-based methods" etc etc). There may also be various
post-processing steps applied to the results. I don't think it's feasible to
re-create live behaviour perfectly for cross-validation scenarios (especially
as these systems are often totally different from Spark).
Even for offline bulk scoring, again there may be many different options
for cold start. Do we intend to support all of them within Spark? Again I don't
think that's feasible, though as discussed on the JIRA we can certainly support
a few useful options, such as "average user" which could indeed serve for both
CV and live scoring purposes.
I actually think `NaN` for live scoring is "better" than say `0`, because
then it is very clear that it's a missing data point (which the system can
choose how to handle) rather than a prediction of `0`.
For CV, I'd expect that predicting `0` would have a dramatic negative
impact on RMSE. So for CV I'd say the `drop` option is more reasonable.
This is not arguing against other reasonable options (average rating,
average user vectors and so on) - we can add those later on user demand. This
is just a start.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]