[
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060300#comment-14060300
]
Xiangrui Meng commented on SPARK-2465:
--------------------------------------
[~sowen] The ALS implementation shuffles data for each iteration. I tested ALS
on 100x Amazon Reviews dataset. Each iteration shuffles about 200GB data (see
the screenshot attached). If we switch to Long, ALS will definitely slow down.
On the other hand, having a few hash collisions may not be a serious problem.
That is essentially random dimensionality reduction and it also densifies the
data, which helps ALS. We can estimate how many users/products we can handle if
we allow 0.1% collision (should be couple million) and discuss more about the
trade-offs.
> Use long as user / item ID for ALS
> ----------------------------------
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.0.1
> Reporter: Sean Owen
> Priority: Minor
> Attachments: Screen Shot 2014-07-13 at 8.49.40 PM.png
>
>
> I'd like to float this for consideration: use longs instead of ints for user
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits
> means collisions are likely after hundreds of thousands of users and items,
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.
--
This message was sent by Atlassian JIRA
(v6.2#6252)