GitHub user mateiz opened a pull request:

    https://github.com/apache/spark/pull/986

    SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys

    The current implementation reads one key with the next hash code as it 
finishes reading the keys with the current hash code, which may cause it to 
miss some matches of the next key. This can cause operations like join to give 
the wrong result when reduce tasks spill to disk and there are hash collisions, 
as values won't be matched together. This PR fixes it by not reading in that 
next key, using a peeking iterator instead.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mateiz/spark spark-2043

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/986.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #986
    
----
commit 892debbbc9f00c3e76eecf8bcc9bc91734c57b21
Author: Matei Zaharia <[email protected]>
Date:   2014-06-05T22:01:55Z

    SPARK-2043: don't read a key with the next hash code in
    ExternalAppendOnlyMap, instead use a buffered iterator to only read
    values with the current hash code.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to