Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/1393#issuecomment-49149396
Let me close this PR for now. I will fork or wrap as necessary. Keep it in
mind, and maybe in a 2.x release this can be revisited. (Matei I ran into more
problems with the `Rating` class retrofit anyway.)
Yes storage is the downside. Your comments on JIRA about effects of
serialization in compressing away the difference are promising. I completely
agree with using `Float` for ratings and even feature vectors.
Yes I understand why random projections are helpful. It doesn't help
accuracy, but may only trivially hurt it in return for some performance gain.
If I have just 1 rating, it doesn't make my recs better to arbitrarily add your
ratings to mine. Sure that's denser, and maybe you're getting less overfitting,
but it's fitting the wrong input for both of us.
A collision here and there is probably acceptable. One in a million
customers? OK. 1%? maybe a problem. I agree, you'd have to quantify this to
decide. If I'm an end user of MLlib bringing even millions of things to my
model, I have to decide. And if it's a problem, have to maintain a lookup table
to use it.
It seemed simplest to moot the problem with a much bigger key space and
engineer around the storage issue. A bit more memory is cheap; accuracy and
engineer time are expensive.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---