Sean Owen created SPARK-2465:
--------------------------------

             Summary: Use long as user / item ID for ALS
                 Key: SPARK-2465
                 URL: https://issues.apache.org/jira/browse/SPARK-2465
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.0.1
            Reporter: Sean Owen
            Priority: Minor


I'd like to float this for consideration: use longs instead of ints for user 
and product IDs in the ALS implementation.

The main reason for is that identifiers are not generally numeric at all, and 
will be hashed to an integer. (This is a separate issue.) Hashing to 32 bits 
means collisions are likely after hundreds of thousands of users and items, 
which is not unrealistic. Hashing to 64 bits pushes this back to billions.

It would also mean numeric IDs that happen to be larger than the largest int 
can be used directly as identifiers.

On the downside of course: 8 bytes instead of 4 bytes of memory used per Rating.

Thoughts? I will post a PR so as to show what the change would be.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to