[
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998177#comment-15998177
]
Rui Li commented on HIVE-15104:
-------------------------------
I looked at the shuffle writers of Spark and none of them seem to need the
hashCode/partitionId after the HiveKey is serialized. But I got a problem
during implementation. The plan is to implement this Spark trait:
{code}
trait KryoRegistrator {
def registerClasses(kryo: Kryo): Unit
}
{code}
Then we set this implementing class to {{spark.kryo.registrator}}. At runtime,
Spark will use reflection to instantiate our class and call its registerClasses
to register the optimized SerDe for HiveKey.
However, Kryo is relocated in Hive. After build, the method signature of our
class will actually be:
{{public void registerClasses(org.apache.hive.com.esotericsoftware.kryo.Kryo
kryo)}}.
When Spark calls the method, we get an {{AbstractMethodError}}. I suppose this
is because the {{public void registerClasses(com.esotericsoftware.kryo.Kryo
kryo)}} method is not really implemented.
Does anybody know how this can be resolved?
> Hive on Spark generate more shuffle data than hive on mr
> --------------------------------------------------------
>
> Key: HIVE-15104
> URL: https://issues.apache.org/jira/browse/HIVE-15104
> Project: Hive
> Issue Type: Bug
> Components: Spark
> Affects Versions: 1.2.1
> Reporter: wangwenli
> Assignee: Rui Li
>
> the same sql, running on spark and mr engine, will generate different size
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive
> on spark which using kryo will serialize full of Hivekey object.
> what is your opionion?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)