Github user YuhuWang2002 commented on the issue:
https://github.com/apache/spark/pull/15297
I do some performance test between use skew join algorithm and not use skew
join algorithm.
I generate 2 table with 1/5 data skew in table S and 1/10000 data skew in
table R. Two table skew in same key.
spark.sql.adaptive.skewjoin.threshold 6000000
spark.sql.adaptive.shuffle.targetPostShuffleInputSize 5000000
record: S 10000000 rows; R 100000000 rows
sql:
select count(*) from R,S where rid=sid and sname>'wang9' and rname >
'zhang9';
skew algorithm : 167.695s
normal algorithm: 303.922s
R2_txt is 100000000 rows without data skew.
sql: select count(*) from R2_txt,S where rid=sid and sname>'wang' and rname
> 'zhang9';
skew algorithm : 38.717s
normal algorithm: 114.21s
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]