[ 
https://issues.apache.org/jira/browse/HIVE-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Indhumathi Muthumurugesh updated HIVE-29358:
--------------------------------------------
    Description: 
Join query is experiencing severe *data skew* and performance degradation when 
the setting {{hive.vectorized.execution.reducesink.new.enabled}} is set to 
{{true}} (the default/optimized setting). When this property is disabled 
({{{}false{}}}), the query runs correctly and efficiently across multiple 
reducers.

 

*Steps to Reproduce:*

 
create table seg1 (a string,b string,c string,d string,e string,f string,g 
string,h int,i string,j string,k string,l string,m int);
load data local inpath '/Users/indhu/Desktop/seg1.txt' into table seg1;

create table seg2 (aa string,bb string,cc string,dd string,ee string,ff 
string,gg string,hh string,ii string,jj string);
load data local inpath '/Users/indhu/Desktop/seg2.txt' into table seg2;
 
Problematic query:
create table seg3 as select a.*,b.bb,b.cc,b.dd,b.ff,b.ii,b.jj from seg1 a, seg2 
b;

!Reduce_sink_enabled.png|width=616,height=358!

 

!Reduce_sink_disabled.png|width=624,height=362!
 
h3. Observed Behavior
 * *Reducer Logs:* The majority of reducers show {{exec.FileSinkOperator: 
FS[X]: records written - 0}}

 * *Bottleneck:* One or a few reducers receive nearly all of the data for the 
join key, causing the task to run for an excessive duration.

 * *Conclusion:* The new vectorized shuffle mechanism is failing to correctly 
partition and distribute the highly skewed keys across the available reducers.

  was:
Join query is experiencing severe *data skew* and performance degradation when 
the setting {{hive.vectorized.execution.reducesink.new.enabled}} is set to 
{{true}} (the default/optimized setting). When this property is disabled 
({{{}false{}}}), the query runs correctly and efficiently across multiple 
reducers.

 

*Steps to Reproduce:*

 
create table seg1 (a string,b string,c string,d string,e string,f string,g 
string,h int,i string,j string,k string,l string,m int);
load data local inpath '/Users/indhu/Desktop/seg1.txt' into table seg1;

create table seg2 (aa string,bb string,cc string,dd string,ee string,ff 
string,gg string,hh string,ii string,jj string);
load data local inpath '/Users/indhu/Desktop/seg2.txt' into table seg2;
 
Problematic query:
create table seg3 as select a.*,b.bb,b.cc,b.dd,b.ff,b.ii,b.jj from seg1 a, seg2 
b;
 
h3. Observed Behavior
 * *Reducer Logs:* The majority of reducers show {{exec.FileSinkOperator: 
FS[X]: records written - 0}}

 * *Bottleneck:* One or a few reducers receive nearly all of the data for the 
join key, causing the task to run for an excessive duration.

 * *Conclusion:* The new vectorized shuffle mechanism is failing to correctly 
partition and distribute the highly skewed keys across the available reducers.


> Vectorized Reduce Sink (new) Causes Skew/Imbalance on Joins
> -----------------------------------------------------------
>
>                 Key: HIVE-29358
>                 URL: https://issues.apache.org/jira/browse/HIVE-29358
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Indhumathi Muthumurugesh
>            Assignee: Indhumathi Muthumurugesh
>            Priority: Major
>         Attachments: Reduce_sink_disabled.png, Reduce_sink_enabled.png, 
> seg1.txt, seg2.txt
>
>
> Join query is experiencing severe *data skew* and performance degradation 
> when the setting {{hive.vectorized.execution.reducesink.new.enabled}} is set 
> to {{true}} (the default/optimized setting). When this property is disabled 
> ({{{}false{}}}), the query runs correctly and efficiently across multiple 
> reducers.
>  
> *Steps to Reproduce:*
>  
> create table seg1 (a string,b string,c string,d string,e string,f string,g 
> string,h int,i string,j string,k string,l string,m int);
> load data local inpath '/Users/indhu/Desktop/seg1.txt' into table seg1;
> create table seg2 (aa string,bb string,cc string,dd string,ee string,ff 
> string,gg string,hh string,ii string,jj string);
> load data local inpath '/Users/indhu/Desktop/seg2.txt' into table seg2;
>  
> Problematic query:
> create table seg3 as select a.*,b.bb,b.cc,b.dd,b.ff,b.ii,b.jj from seg1 a, 
> seg2 b;
> !Reduce_sink_enabled.png|width=616,height=358!
>  
> !Reduce_sink_disabled.png|width=624,height=362!
>  
> h3. Observed Behavior
>  * *Reducer Logs:* The majority of reducers show {{exec.FileSinkOperator: 
> FS[X]: records written - 0}}
>  * *Bottleneck:* One or a few reducers receive nearly all of the data for the 
> join key, causing the task to run for an excessive duration.
>  * *Conclusion:* The new vectorized shuffle mechanism is failing to correctly 
> partition and distribute the highly skewed keys across the available reducers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to