[
https://issues.apache.org/jira/browse/HIVE-20032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540257#comment-16540257
]
Sahil Takiar commented on HIVE-20032:
-------------------------------------
[~lirui] could you take a look?
This patch also turns {{hive.spark.optimize.shuffle.serde}} on by default. I
think we should try to get to a point where we never have to serialize the
hashCode. It's confusing to users migrating from Hive-on-MR to HoS when they
see a query that requires more shuffle data in HoS than Hive-on-MR.
This is the first step towards achieving that. Doing it completely will be
tricky. Off the top of my head, we will need a way to specify separate
serializers for cacheing RDDs vs. shuffling them. We will also need a way to
preserve the hashCode for {{groupByKey}}.
> Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled
> -------------------------------------------------------------------------
>
> Key: HIVE-20032
> URL: https://issues.apache.org/jira/browse/HIVE-20032
> Project: Hive
> Issue Type: Improvement
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
> Attachments: HIVE-20032.1.patch, HIVE-20032.2.patch,
> HIVE-20032.3.patch
>
>
> Follow up on HIVE-15104, if we don't enable RDD cacheing or groupByShuffles,
> then we don't need to serialize the hashCode when shuffling data in HoS.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)