GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/3795

    [SPARK-3847] Raise exception when hashing Java enums

    This patch modifies Spark to throw exceptions when attempting to 
hash-partition Java Enums.  Java Enums' hashCodes are machine/JVM-dependent, so 
it is unsafe to compare enum hashCodes generated in different JVMs; this means 
that we can't partition RDDs with enum keys.
    
    The fix here is based on a similar fix which prevents Java arrays from 
being used as keys when repartitioning 
(https://github.com/mesos/spark/pull/348): at RDD definition time, use 
reflection to check whether the key class is an enumeration and throw a warning 
if a HashPartitioner is being used.  We do not throw a warning if some other 
partitioner is used because, in principle, that partitioner could have custom 
logic for properly handling enums (e.g. by calling `toString()` and using the 
string's hashcode).  There are some corner-cases that this will miss (such as 
enums that are nested in other objects, like pairs of enums), but I think that 
this may be the best that we can do without adding a per-record performance 
overhead or marking changes to the shuffle code.
    
    In case we have to add similar error checks in the future, I've factored 
the logic into a helper function in the `Partitioner` object.  I also improved 
the warning messages to reference the relevant JIRA issues.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-3847

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3795.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3795
    
----
commit 4ae4efca4d5f1dbe5207c71702cc46c943b23b56
Author: Josh Rosen <[email protected]>
Date:   2014-12-25T02:50:21Z

    [SPARK-3847] Raise exception when hashing Java enums

commit c41483d035ebf3b5b20acfb244c67b8bbea24412
Author: Josh Rosen <[email protected]>
Date:   2014-12-25T05:30:13Z

    Minor documentation fixes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to