Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49149396
  
    Let me close this PR for now. I will fork or wrap as necessary. Keep it in 
mind, and maybe in a 2.x release this can be revisited. (Matei I ran into more 
problems with the `Rating` class retrofit anyway.)
    
    Yes storage is the downside. Your comments on JIRA about effects of 
serialization in compressing away the difference are promising. I completely 
agree with using `Float` for ratings and even feature vectors.
    
    Yes I understand why random projections are helpful. It doesn't help 
accuracy, but may only trivially hurt it in return for some performance gain. 
If I have just 1 rating, it doesn't make my recs better to arbitrarily add your 
ratings to mine. Sure that's denser, and maybe you're getting less overfitting, 
but it's fitting the wrong input for both of us.
    
    A collision here and there is probably acceptable. One in a million 
customers? OK. 1%? maybe a problem. I agree, you'd have to quantify this to 
decide. If I'm an end user of MLlib bringing even millions of things to my 
model, I have to decide. And if it's a problem, have to maintain a lookup table 
to use it.
    
    It seemed simplest to moot the problem with a much bigger key space and 
engineer around the storage issue. A bit more memory is cheap; accuracy and 
engineer time are expensive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to