[ https://issues.apache.org/jira/browse/SPARK-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164280#comment-14164280 ]
Josh Rosen commented on SPARK-3847: ----------------------------------- Java arrays' hashCodes have a similar problem: they are based on the arrays' identities rather than their contents. SPARK-597 handled this by adding logic that throws an exception when attempting to partition or shuffle an RDD with Array types as keys while using the default HashPartitioner: https://github.com/apache/spark/commit/c593f6329ee6f4f319810432c17b6d5703a3e0eb. Maybe we can do something similar here, too. > Enum.hashCode is only consistent within the same JVM > ---------------------------------------------------- > > Key: SPARK-3847 > URL: https://issues.apache.org/jira/browse/SPARK-3847 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.1.0 > Environment: Oracle JDK 7u51 64bit on Ubuntu 12.04 > Reporter: Nathan Bijnens > Labels: enum > > When using java Enum's as key in some operations the results will be very > unexpected. The issue is that the Java Enum.hashCode returns the > memoryposition, which is different on each JVM. > {code} > messages.filter(_.getHeader.getKind == Kind.EVENT).count > >> 503650 > val tmp = messages.filter(_.getHeader.getKind == Kind.EVENT) > tmp.map(_.getHeader.getKind).countByValue > >> Map(EVENT -> 1389) > {code} > Because it's actually a JVM issue we either should reject with an error enums > as key or implement a workaround. > A good writeup of the issue can be found here (and a workaround): > http://dev.bizo.com/2014/02/beware-enums-in-spark.html > Somewhat more on the hash codes and Enum's: > https://stackoverflow.com/questions/4885095/what-is-the-reason-behind-enum-hashcode > And some issues (most of them rejected) at the Oracle Bug Java database: > - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8050217 > - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7190798 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org