[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS

tmyklebu Mon, 14 Apr 2014 15:19:33 -0700

GitHub user tmyklebu opened a pull request:

    https://github.com/apache/spark/pull/407


    [SPARK-1281] Improve partitioning in ALS

    ALS was using HashPartitioner and explicit uses of `%` together.  Further, 
the naked use of `%` meant that, if the number of partitions corresponded with 
the stride of arithmetic progressions appearing in user and product ids, users 
and products could be mapped into buckets in an unfair or unwise way.
    
    This pull request:
    1) Makes the Partitioner an instance variable of ALS.
    2) Replaces the direct uses of `%` with calls to a Partitioner.
    3) Defines an anonymous Partitioner that scrambles the bits of the object's 
hashCode before reducing to the number of present buckets.
    
    This pull request does not make the partitioner user-configurable.
    
    I'm not all that happy about the way I did (1).  It introduces an icky 
lifetime issue and dances around it by nulling something.  However, I don't 
know a better way to make the partitioner visible everywhere it needs to be 
visible.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tmyklebu/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/407.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #407
    
----
commit c774d7d4bff91c9387d059d1189799fa0ff1f4b0
Author: Tor Myklebust <[email protected]>
Date:   2014-04-14T22:01:18Z

    Make the partitioner a member variable and use it instead of modding 
directly.

commit c90b6d8e91f86cf89adf28de6f9185647c87e5c8
Author: Tor Myklebust <[email protected]>
Date:   2014-04-14T22:10:30Z

    Scramble user and product ids before bucketing.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1281] Improve partitioning in ALS

Reply via email to