[
https://issues.apache.org/jira/browse/HIVE-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Indhumathi Muthumurugesh updated HIVE-29358:
--------------------------------------------
Attachment: seg1.txt
> Vectorized Reduce Sink (new) Causes Skew/Imbalance on Joins
> -----------------------------------------------------------
>
> Key: HIVE-29358
> URL: https://issues.apache.org/jira/browse/HIVE-29358
> Project: Hive
> Issue Type: Bug
> Reporter: Indhumathi Muthumurugesh
> Assignee: Indhumathi Muthumurugesh
> Priority: Major
> Attachments: seg1.txt, seg2.txt
>
>
> Join query is experiencing severe *data skew* and performance degradation
> when the setting {{hive.vectorized.execution.reducesink.new.enabled}} is set
> to {{true}} (the default/optimized setting). When this property is disabled
> ({{{}false{}}}), the query runs correctly and efficiently across multiple
> reducers.
>
> *Steps to Reproduce:*
>
> create table seg1 (a string,b string,c string,d string,e string,f string,g
> string,h int,i string,j string,k string,l string,m int);
> load data local inpath '/Users/indhu/Desktop/seg1.txt' into table seg1;
> create table seg2 (aa string,bb string,cc string,dd string,ee string,ff
> string,gg string,hh string,ii string,jj string);
> load data local inpath '/Users/indhu/Desktop/seg2.txt' into table seg2;
>
> Problematic query:
> create table seg3 as select a.*,b.bb,b.cc,b.dd,b.ff,b.ii,b.jj from seg1 a,
> seg2 b;
>
> h3. Observed Behavior
> * *Reducer Logs:* The majority of reducers show {{exec.FileSinkOperator:
> FS[X]: records written - 0}}
> * *Bottleneck:* One or a few reducers receive nearly all of the data for the
> join key, causing the task to run for an excessive duration.
> * *Conclusion:* The new vectorized shuffle mechanism is failing to correctly
> partition and distribute the highly skewed keys across the available reducers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)