[
https://issues.apache.org/jira/browse/HIVE-20032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542413#comment-16542413
]
Sahil Takiar commented on HIVE-20032:
-------------------------------------
As for benchmarking, I have done a lot of TPC-DS benchmarking, and I don't
consistently get better performance. However, the amount of shuffled data is
significantly reduced (as well as the amount of data spilled to disk). My guess
is that latency doesn't improve much because I'm running my tests on a unloaded
cluster. However, I expect cluster throughput to be better with this patch
since less I/O resources are being used. I'll need to run some concurrent
TPC-DS workloads to confirm this though.
> Don't serialize hashCode when groupByShuffle and RDD cacheing is disabled
> -------------------------------------------------------------------------
>
> Key: HIVE-20032
> URL: https://issues.apache.org/jira/browse/HIVE-20032
> Project: Hive
> Issue Type: Improvement
> Components: Spark
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
> Attachments: HIVE-20032.1.patch, HIVE-20032.2.patch,
> HIVE-20032.3.patch, HIVE-20032.4.patch
>
>
> Follow up on HIVE-15104, if we don't enable RDD cacheing or groupByShuffles,
> then we don't need to serialize the hashCode when shuffling data in HoS.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)