[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Rui Li (JIRA) Wed, 30 Aug 2017 19:04:17 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148340#comment-16148340
 ]


Rui Li commented on HIVE-15104:
-------------------------------

[~xuefuz], my previous 
[comment|https://issues.apache.org/jira/browse/HIVE-15104?focusedCommentId=15998177&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15998177]
 has some explanations about the relocation problem. Basically, the problem is 
we need to implement some method defined by Spark, and the method accepts a 
kryo parameter. With relocation, Hive's kryo and Spark's kryo are in different 
packages. If we compile the class in Hive and runs it in Spark, Spark will find 
the method not implemented because it has a different signature.

> Hive on Spark generate more shuffle data than hive on mr
> --------------------------------------------------------
>
>                 Key: HIVE-15104
>                 URL: https://issues.apache.org/jira/browse/HIVE-15104
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.2.1
>            Reporter: wangwenli
>            Assignee: Rui Li
>         Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, HIVE-15104.4.patch, HIVE-15104.5.patch, 
> HIVE-15104.5.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Reply via email to