[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Rui Li (JIRA) Fri, 16 Jun 2017 08:00:17 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16052006#comment-16052006
 ]


Rui Li commented on HIVE-15104:
-------------------------------

The approach here can cause problem when we cache RDDs, e.g. combining 
equivalent works. The cached RDDs will be serialized when stored to disk or 
transferred via network, then we need the hash code after the data is 
deserialized. I think we have to ser/de the hash code anyway to be safe.

> Hive on Spark generate more shuffle data than hive on mr
> --------------------------------------------------------
>
>                 Key: HIVE-15104
>                 URL: https://issues.apache.org/jira/browse/HIVE-15104
>             Project: Hive
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 1.2.1
>            Reporter: wangwenli
>            Assignee: Rui Li
>         Attachments: HIVE-15104.1.patch, HIVE-15104.2.patch, 
> HIVE-15104.3.patch, TPC-H 100G.xlsx
>
>
> the same sql,  running on spark  and mr engine, will generate different size 
> of shuffle data.
> i think it is because of hive on mr just serialize part of HiveKey, but hive 
> on spark which using kryo will serialize full of Hivekey object.  
> what is your opionion?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-15104) Hive on Spark generate more shuffle data than hive on mr

Reply via email to