[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Rui Li (JIRA) Fri, 05 May 2017 04:19:26 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998177#comment-15998177
 ]


Rui Li commented on HIVE-15104:
-------------------------------

I looked at the shuffle writers of Spark and none of them seem to need the 
hashCode/partitionId after the HiveKey is serialized. But I got a problem 
during implementation. The plan is to implement this Spark trait:
{code}
trait KryoRegistrator {
  def registerClasses(kryo: Kryo): Unit
}
{code}
Then we set this implementing class to {{spark.kryo.registrator}}. At runtime, 
Spark will use reflection to instantiate our class and call its registerClasses 
to register the optimized SerDe for HiveKey.
However, Kryo is relocated in Hive. After build, the method signature of our 
class will actually be:
{{public void registerClasses(org.apache.hive.com.esotericsoftware.kryo.Kryo 
kryo)}}.
When Spark calls the method, we get an {{AbstractMethodError}}. I suppose this 
is because the {{public void registerClasses(com.esotericsoftware.kryo.Kryo 
kryo)}} method is not really implemented.
Does anybody know how this can be resolved?

> Hive on Spark generate more shuffle data than hive on mr
> --------------------------------------------------------
>
>                 Key: HIVE-15104
>                 URL: https://issues.apache.org/jira/browse/HIVE-15104
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.2.1
>            Reporter: wangwenli
>            Assignee: Rui Li
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Reply via email to