Indhumathi Muthumurugesh created HIVE-29358:
-----------------------------------------------

             Summary: Vectorized Reduce Sink (new) Causes Skew/Imbalance on 
Joins
                 Key: HIVE-29358
                 URL: https://issues.apache.org/jira/browse/HIVE-29358
             Project: Hive
          Issue Type: Bug
            Reporter: Indhumathi Muthumurugesh


Join query is experiencing severe *data skew* and performance degradation when 
the setting {{hive.vectorized.execution.reducesink.new.enabled}} is set to 
{{true}} (the default/optimized setting). When this property is disabled 
({{{}false{}}}), the query runs correctly and efficiently across multiple 
reducers.

 

*Steps to Reproduce:*

 
create table seg1 (a string,b string,c string,d string,e string,f string,g 
string,h int,i string,j string,k string,l string,m int);
load data local inpath '/Users/indhu/Desktop/seg1.txt' into table seg1;

create table seg2 (aa string,bb string,cc string,dd string,ee string,ff 
string,gg string,hh string,ii string,jj string);
load data local inpath '/Users/indhu/Desktop/seg2.txt' into table seg2;
 
Problematic query:
create table seg3 as select a.*,b.bb,b.cc,b.dd,b.ff,b.ii,b.jj from seg1 a, seg2 
b;
 
h3. Observed Behavior
 * *Reducer Logs:* The majority of reducers show {{exec.FileSinkOperator: 
FS[X]: records written - 0}}

 * *Bottleneck:* One or a few reducers receive nearly all of the data for the 
join key, causing the task to run for an excessive duration.

 * *Conclusion:* The new vectorized shuffle mechanism is failing to correctly 
partition and distribute the highly skewed keys across the available reducers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to