[ https://issues.apache.org/jira/browse/SPARK-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577602#comment-14577602 ]
Yin Huai commented on SPARK-8102: --------------------------------- For the first query, {code} -- snippet 1 SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt FROM t_category c, t_zipcode z, click_meter_site_grouped g WHERE c.refCategoryID = g.category AND z.regionCode = g.region {code} I guess the analyzer got confused because first two tables are c and z but there is no join condition between them. So, a {{CartesianProduct}} was used. Although your tables can be small, but the result of a {{CartesianProduct}} can be huge. I guess we can do something in the analyzer to get rid of the {{CartesianProduct}}. > Big performance difference when joining 3 tables in different order > ------------------------------------------------------------------- > > Key: SPARK-8102 > URL: https://issues.apache.org/jira/browse/SPARK-8102 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.1 > Environment: spark in local mode > Reporter: Hao Ren > Attachments: query2job.png, query3job.png > > > Given 3 tables loaded from CSV files: > ( tables name => size) > *click_meter_site_grouped* =>10 687 455 bytes > *t_zipcode* => 2 738 954 bytes > *t_category* => 2 182 bytes > When joining the 3 tables, I notice a large performance difference if they > are joined in different order. > Here are the SQL queries to compare: > {code} > -- snippet 1 > SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt > FROM t_category c, t_zipcode z, click_meter_site_grouped g > WHERE c.refCategoryID = g.category AND z.regionCode = g.region > {code} > {code} > -- snippet 2 > SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt > FROM t_category c, click_meter_site_grouped g, t_zipcode z > WHERE c.refCategoryID = g.category AND z.regionCode = g.region > {code} > As you see, the largest table *click_meter_site_grouped* is the last table in > FROM clause in the first snippet, and it is in the middle of table list in > second one. > Snippet 2 runs three times faster than Snippet 1. > (8 seconds VS 24 seconds) > As the data is just sampled from a large data set, if we test it on the > original data set, it will normally result in a performance issue. > After checking the log, we found something strange In snippet 1's log: > 15/06/04 15:32:03 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:04 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:04 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:05 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:05 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:05 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:05 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:06 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:06 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:06 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:07 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:07 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:07 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:07 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:08 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:08 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:08 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:09 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:09 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:09 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:09 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:10 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:10 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:10 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:11 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:11 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:11 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:11 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:12 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:12 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:12 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:13 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:13 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:13 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:13 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:14 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:14 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:14 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:15 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:15 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:15 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:16 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:16 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:16 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:16 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:17 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:17 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:17 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:18 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:18 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:18 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:18 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:19 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:19 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:19 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:20 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > 15/06/04 15:32:20 INFO HadoopRDD: Input split: > file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954 > It seems that *t_zipcode* is loaded 56 times !!! And, for snippet 2, > everything is fine, all the three tables are loaded only once. > Knowing that SparkSQL's join can automatically broadcast table in join when > its size is below *autoBroadcastJoinThreshold*. Not sure if the over-load is > caused by auto broadcast. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org