[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

YuhuWang2002 Mon, 24 Oct 2016 23:11:20 -0700

Github user YuhuWang2002 commented on the issue:

    https://github.com/apache/spark/pull/15297
  
    I do some performance test between use skew join algorithm and not use skew 
join  algorithm.
    I generate 2 table with 1/5 data skew in table S and 1/10000 data skew in 
table R. Two table skew in same key.
    
    spark.sql.adaptive.skewjoin.threshold   6000000
    spark.sql.adaptive.shuffle.targetPostShuffleInputSize   5000000
    record: S 10000000 rows; R 100000000 rows
    sql:
    select count(*) from R,S where rid=sid and sname>'wang9' and rname > 
'zhang9';
    
    skew algorithm : 167.695s
    normal algorithm: 303.922s
    
    R2_txt is 100000000 rows without data skew.
    sql: select count(*) from R2_txt,S where rid=sid and sname>'wang' and rname 
> 'zhang9';
    skew algorithm : 38.717s
    normal algorithm: 114.21s




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15297: [SPARK-9862]Handling data skew

Reply via email to