GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/1393

    SPARK-2465. Use long as user / item ID for ALS

    I'd like to float this for consideration: use longs instead of ints for 
user and product IDs in the ALS implementation.
    
    The main reason for is that identifiers are not generally numeric at all, 
and will be hashed to an integer. (This is a separate issue.) Hashing to 32 
bits means collisions are likely after hundreds of thousands of users and 
items, which is not unrealistic. Hashing to 64 bits pushes this back to 
billions.
    
    It would also mean numeric IDs that happen to be larger than the largest 
int can be used directly as identifiers.
    
    On the downside of course: 8 bytes instead of 4 bytes of memory used per 
Rating.
    
    Thoughts?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-2465

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1393.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1393
    
----
commit d4082ad7aa7468b63605d141f6d68e278983678a
Author: Sean Owen <[email protected]>
Date:   2014-07-13T14:57:15Z

    Use long instead of int for user/product IDs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to