Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1393#issuecomment-49131064
  
    Besides breaking the API, I'm also worried about two things:
    
    1. The increase in storage. We had some discussion before v1.0 about 
whether we should switch to long or not. ALS is not computation heavy for small 
k but communication heavy. I posted some screenshots on the JIRA page, where 
ALS shuffles ~200GB data in each iteration. With Long ids, this number may 
become ~300GB and hence ALS may slow down by 50%. Instead of upgrading the id 
type to Long, I'm actually thinking about downgrading the rating type to Float.
    
    2. Is collision really bad? ALS needs somewhat "dense" matrix to compute 
good recommendations. If there are 3 billion users but each user only gives 1 
or 2 ratings, ALS is very likely to overfit. In this case, making a random 
projection on the user side would certainly help, while hashing is one of the 
commonly used techniques for random projection. There will be bad 
recommendations no matter whether there exist hash collisions or not. So I'm 
really interested in some measurements on the downside of hash collision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to