[
https://issues.apache.org/jira/browse/SPARK-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060127#comment-14060127
]
Sean Owen commented on SPARK-2465:
----------------------------------
https://github.com/apache/spark/pull/1393
> Use long as user / item ID for ALS
> ----------------------------------
>
> Key: SPARK-2465
> URL: https://issues.apache.org/jira/browse/SPARK-2465
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.0.1
> Reporter: Sean Owen
> Priority: Minor
>
> I'd like to float this for consideration: use longs instead of ints for user
> and product IDs in the ALS implementation.
> The main reason for is that identifiers are not generally numeric at all, and
> will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits
> means collisions are likely after hundreds of thousands of users and items,
> which is not unrealistic. Hashing to 64 bits pushes this back to billions.
> It would also mean numeric IDs that happen to be larger than the largest int
> can be used directly as identifiers.
> On the downside of course: 8 bytes instead of 4 bytes of memory used per
> Rating.
> Thoughts? I will post a PR so as to show what the change would be.
--
This message was sent by Atlassian JIRA
(v6.2#6252)