Hao Ren created SPARK-8102:
------------------------------
Summary: Big performance difference when joining 3 tables in
different order
Key: SPARK-8102
URL: https://issues.apache.org/jira/browse/SPARK-8102
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.3.1
Environment: spark in local mode
Reporter: Hao Ren
Given 3 tables loaded from CSV files:
( tables name => size)
*click_meter_site_grouped* =>10 687 455 bytes
*t_zipcode* => 2 738 954 bytes
*t_category* => 2 182 bytes
When joining the 3 tables, I notice a large performance difference if they are
joined in different order.
Here are the SQL queries to compare:
{code}
-- snippet 1
SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
FROM t_category c, t_zipcode z, click_meter_site_grouped g
WHERE c.refCategoryID = g.category AND z.regionCode = g.region
{code}
{code}
-- snippet 2
SELECT g.period, c.categoryName, z.regionName, action, list_id, cnt
FROM t_category c, click_meter_site_grouped g, t_zipcode z
WHERE c.refCategoryID = g.category AND z.regionCode = g.region
{code}
As you see, the largest table *click_meter_site_grouped* is the last table in
FROM clause in the first snippet, and it is in the middle of table list in
second one.
Snippet 2 runs three times faster than Snippet 1.
(8 seconds VS 24 seconds)
As the data is just sampled from a large data set, if we test it on the
original data set, it will normally result in a performance issue.
After checking the log, we found something strange In snippet 1's log:
15/06/04 15:32:03 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:04 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:04 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:05 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:05 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:05 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:05 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:06 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:06 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:06 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:07 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:07 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:07 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:07 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:08 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:08 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:08 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:09 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:09 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:09 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:09 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:10 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:10 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:10 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:11 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:11 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:11 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:11 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:12 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:12 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:12 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:13 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:13 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:13 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:13 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:14 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:14 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:14 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:15 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:15 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:15 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:16 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:16 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:16 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:16 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:17 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:17 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:17 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:18 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:18 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:18 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:18 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:19 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:19 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:19 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:20 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
15/06/04 15:32:20 INFO HadoopRDD: Input split:
file:/home/invkrh/workspace/java/data_spark_etl/data-sample/bconf/zipcodes_4.txt:0+2738954
It seems that *t_zipcode* is loaded 56 times !!! And, for snippet 2, everything
is fine, all the three tables are loaded only once.
Knowing that SparkSQL's join can automatically broadcast table in join when its
size is below *autoBroadcastJoinThreshold*. Not sure if the over-load is caused
by auto broadcast.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]