habren opened a new pull request #24726: [SPARK-27865][SQL] Support 1:N sort 
merge bucket join without shuffle
URL: https://github.com/apache/spark/pull/24726
 
 
   ## Support 1:N sort merge bucket join without shuffle
   
   
   ## Test
   Here is the code for verification
   ```scala
   val spark = SparkSession.builder()
       .master("local[*]")
       .appName("TestBucketJoin")
       .config("spark.sql.autoBroadcastJoinThreshold", 1)
       .getOrCreate()
   
   spark.sql(
       """create table tbl1(a int, b int)
         |using csv 
         |clustered by (a) 
         |sorted by (a) 
         |into 4 buckets
         |""".stripMargin)
     spark.sql(
       """create table tbl2(a int, b int)
         |using csv 
         |clustered by (a) 
         |sorted by (a) 
         |into 4 buckets
         |""".stripMargin)
     spark.sql(
       """create table tbl3(a int, b int)
         |using csv 
         |clustered by (a) 
         |sorted by (a) 
         |into 12 buckets
         |""".stripMargin)
   
     import spark.implicits._
     val data = spark.sparkContext.parallelize(0 until 12, 1)
     spark.createDataset(data).createOrReplaceTempView("data")
   
     spark.sql("insert overwrite table tbl1 select value, value from data")
     spark.sql("insert overwrite table tbl2 select value, value from data")
     spark.sql("insert overwrite table tbl3 select value, value from data")
     
     spark.sql("select * from tbl1 join tbl3 on tbl1.a = tbl3.a").show()
   ```
   
   For the join in the last line, this feature make sure that the sort merge 
bucket join is used to join the two tables which has 4 and 12 buckets 
respectively.
   
   
   ![1 N bucket join 
DAG](https://user-images.githubusercontent.com/3096874/58465000-5b9bd800-8169-11e9-9d0c-6031b7dc20d0.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to