[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Rui Li (JIRA) Thu, 12 Oct 2017 02:33:15 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16201695#comment-16201695
 ]


Rui Li commented on HIVE-15104:
-------------------------------

One correction: the {{NoClassDefFoundError}} is for 
{{com.esotericsoftware.kryo.Serializer}}. That's because our HiveKey and 
BytesWritable serializer extend kryo's Serializer. When loading our classes, 
the super class also needs to be loaded and thus the error.

Since the serializers are static nested classes of HiveKryoRegistrator, I tried 
loading the class w/o linking it, i.e. by calling {{ClassLoader.loadClass()}}. 
And that can avoid the NoClassDefFoundError. But not sure whether this is 
reliable and independent from JVM implementations.

> Hive on Spark generate more shuffle data than hive on mr
> --------------------------------------------------------
>
>                 Key: HIVE-15104
>                 URL: https://issues.apache.org/jira/browse/HIVE-15104
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.2.1
>            Reporter: wangwenli
>            Assignee: Rui Li
>         Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, 
> HIVE-15104.5.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Reply via email to