Re:Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-13 Thread Todd


Thanks Davies for the explanation.
When I turn off the following options, I still see that spark1.5 is much slower 
than 1.4.1. I am thinking how I can configure so that spark 1.5 can have 
similar performance as spark1.4 for this particular query..

--conf spark.sql.planner.sortMergeJoin=false 
--conf spark.sql.tungsten.enabled=false
--conf spark.shuffle.reduceLocality.enabled=false
--conf spark.sql.planner.externalSort=false
--conf spark.sql.parquet.filterPushdown=false
--conf spark.sql.codegen=false









At 2015-09-12 01:32:15, "Davies Liu" <dav...@databricks.com> wrote:
>I had ran similar benchmark for 1.5, do self join on a fact table with
>join key that had many duplicated rows (there are N rows for the same
>join key), say N, after join, there will be N*N rows for each join
>key. Generating the joined row is slower in 1.5 than 1.4 (it needs to
>copy left and right row together, but not in 1.4). If the generated
>row is accessed after join, there will be not much difference between
>1.5 and 1.4, because accessing the joined row is slower in 1.4 than
>1.5.
>
>So, for this particular query, 1.5 is slower than 1.4, will be even
>slower if you increase the N. But for real workload, it will not, 1.5
>is usually faster than 1.4.
>
>On Fri, Sep 11, 2015 at 1:31 AM, prosp4300 <prosp4...@163.com> wrote:
>>
>>
>> By the way turn off the code generation could be an option to try, sometime 
>> code generation could introduce slowness
>>
>>
>> 在2015年09月11日 15:58,Cheng, Hao 写道:
>>
>> Can you confirm if the query really run in the cluster mode? Not the local 
>> mode. Can you print the call stack of the executor when the query is running?
>>
>>
>>
>> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
>> Spark SQL.
>>
>>
>>
>> From: Todd [mailto:bit1...@163.com]
>> Sent: Friday, September 11, 2015 3:39 PM
>> To: Todd
>> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
>> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
>> compared with spark 1.4.1 SQL
>>
>>
>>
>> I add the following two options:
>> spark.sql.planner.sortMergeJoin=false
>> spark.shuffle.reduceLocality.enabled=false
>>
>> But it still performs the same as not setting them two.
>>
>> One thing is that on the spark ui, when I click the SQL tab, it shows an 
>> empty page but the header title 'SQL',there is no table to show queries and 
>> execution plan information.
>>
>>
>>
>>
>>
>> At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote:
>>
>>
>> Thanks Hao.
>>  Yes,it is still low as SMJ。Let me try the option your suggested,
>>
>>
>>
>>
>> At 2015-09-11 14:34:46, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>>
>> You mean the performance is still slow as the SMJ in Spark 1.5?
>>
>>
>>
>> Can you set the spark.shuffle.reduceLocality.enabled=false when you start 
>> the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
>> default, but we found it probably causes the performance reduce dramatically.
>>
>>
>>
>>
>>
>> From: Todd [mailto:bit1...@163.com]
>> Sent: Friday, September 11, 2015 2:17 PM
>> To: Cheng, Hao
>> Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
>> Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
>> spark 1.4.1 SQL
>>
>>
>>
>> Thanks Hao for the reply.
>> I turn the merge sort join off, the physical plan is below, but the 
>> performance is roughly the same as it on...
>>
>> == Physical Plan ==
>> TungstenProject 
>> [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
>>  ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
>>   TungstenExchange hashpartitioning(ss_item_sk#2)
>>ConvertToUnsafe
>> Scan 
>> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
>>   TungstenExchange hashpartitioning(ss_item_sk#25)
>>ConvertToUnsafe
>> Scan 
>> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]
>>
>> Code Generation: true
>>
>>
>>
>>
>> At 2015-09-11 13:48:23, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>>
>> This is not a big surprise the S

Re: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
I had ran similar benchmark for 1.5, do self join on a fact table with
join key that had many duplicated rows (there are N rows for the same
join key), say N, after join, there will be N*N rows for each join
key. Generating the joined row is slower in 1.5 than 1.4 (it needs to
copy left and right row together, but not in 1.4). If the generated
row is accessed after join, there will be not much difference between
1.5 and 1.4, because accessing the joined row is slower in 1.4 than
1.5.

So, for this particular query, 1.5 is slower than 1.4, will be even
slower if you increase the N. But for real workload, it will not, 1.5
is usually faster than 1.4.

On Fri, Sep 11, 2015 at 1:31 AM, prosp4300 <prosp4...@163.com> wrote:
>
>
> By the way turn off the code generation could be an option to try, sometime 
> code generation could introduce slowness
>
>
> 在2015年09月11日 15:58,Cheng, Hao 写道:
>
> Can you confirm if the query really run in the cluster mode? Not the local 
> mode. Can you print the call stack of the executor when the query is running?
>
>
>
> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
> Spark SQL.
>
>
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 3:39 PM
> To: Todd
> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
>
>
> I add the following two options:
> spark.sql.planner.sortMergeJoin=false
> spark.shuffle.reduceLocality.enabled=false
>
> But it still performs the same as not setting them two.
>
> One thing is that on the spark ui, when I click the SQL tab, it shows an 
> empty page but the header title 'SQL',there is no table to show queries and 
> execution plan information.
>
>
>
>
>
> At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote:
>
>
> Thanks Hao.
>  Yes,it is still low as SMJ。Let me try the option your suggested,
>
>
>
>
> At 2015-09-11 14:34:46, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>
> You mean the performance is still slow as the SMJ in Spark 1.5?
>
>
>
> Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
> spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
> default, but we found it probably causes the performance reduce dramatically.
>
>
>
>
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 2:17 PM
> To: Cheng, Hao
> Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
>
>
> Thanks Hao for the reply.
> I turn the merge sort join off, the physical plan is below, but the 
> performance is roughly the same as it on...
>
> == Physical Plan ==
> TungstenProject 
> [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
>  ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
>   TungstenExchange hashpartitioning(ss_item_sk#2)
>ConvertToUnsafe
> Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
>   TungstenExchange hashpartitioning(ss_item_sk#25)
>ConvertToUnsafe
> Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]
>
> Code Generation: true
>
>
>
>
> At 2015-09-11 13:48:23, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>
> This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
> fully utilize the sorting yet, more details can be found at 
> https://issues.apache.org/jira/browse/SPARK-2926 .
>
>
>
> Anyway, can you disable the sort merge join by 
> “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query 
> again? In our previous testing, it’s about 20% slower for sort merge join. I 
> am not sure if there anything else slow down the performance.
>
>
>
> Hao
>
>
>
>
>
> From: Jesse F Chen [mailto:jfc...@us.ibm.com]
> Sent: Friday, September 11, 2015 1:18 PM
> To: Michael Armbrust
> Cc: Todd; user@spark.apache.org
> Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
>
>
> Could this be a build issue (i.e., sbt package)?
>
> If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression 
> too in queries (all other things identical)...
>
> I am curious, to build 1.5 (when it isn't released yet), what do I need 

RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Cheng, Hao
Can you confirm if the query really run in the cluster mode? Not the local 
mode. Can you print the call stack of the executor when the query is running?

BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
Spark SQL.

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 3:39 PM
To: Todd
Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
compared with spark 1.4.1 SQL

I add the following two options:
spark.sql.planner.sortMergeJoin=false
spark.shuffle.reduceLocality.enabled=false

But it still performs the same as not setting them two.

One thing is that on the spark ui, when I click the SQL tab, it shows an empty 
page but the header title 'SQL',there is no table to show queries and execution 
plan information.




At 2015-09-11 14:39:06, "Todd" <bit1...@163.com<mailto:bit1...@163.com>> wrote:


Thanks Hao.
 Yes,it is still low as SMJ。Let me try the option your suggested,


At 2015-09-11 14:34:46, "Cheng, Hao" 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:

You mean the performance is still slow as the SMJ in Spark 1.5?

Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
default, but we found it probably causes the performance reduce dramatically.


From: Todd [mailto:bit1...@163.com<mailto:bit1...@163.com>]
Sent: Friday, September 11, 2015 2:17 PM
To: Cheng, Hao
Cc: Jesse F Chen; Michael Armbrust; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
spark 1.4.1 SQL

Thanks Hao for the reply.
I turn the merge sort join off, the physical plan is below, but the performance 
is roughly the same as it on...

== Physical Plan ==
TungstenProject 
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
 ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
  TungstenExchange hashpartitioning(ss_item_sk#2)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
  TungstenExchange hashpartitioning(ss_item_sk#25)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]

Code Generation: true



At 2015-09-11 13:48:23, "Cheng, Hao" 
<hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote:
This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
fully utilize the sorting yet, more details can be found at 
https://issues.apache.org/jira/browse/SPARK-2926 .

Anyway, can you disable the sort merge join by 
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query again? 
In our previous testing, it’s about 20% slower for sort merge join. I am not 
sure if there anything else slow down the performance.

Hao


From: Jesse F Chen [mailto:jfc...@us.ibm.com<mailto:jfc...@us.ibm.com>]
Sent: Friday, September 11, 2015 1:18 PM
To: Michael Armbrust
Cc: Todd; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL


Could this be a build issue (i.e., sbt package)?

If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too 
in queries (all other things identical)...

I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
with the build.sbt file?

any special parameters i should be using to make sure I load the latest hive 
dependencies?

[Inactive hide details for Michael Armbrust ---09/10/2015 11:07:28 AM---I've 
been running TPC-DS SF=1500 daily on Spark 1.4.1 an]Michael Armbrust 
---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 daily on Spark 
1.4.1 and Spark 1.5 on S3, so this is surprising.  I

From: Michael Armbrust <mich...@databricks.com<mailto:mich...@databricks.com>>
To: Todd <bit1...@163.com<mailto:bit1...@163.com>>
Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Date: 09/10/2015 11:07 AM
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL





I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
this is surprising.  In my experiments Spark 1.5 is either the same or faster 
than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever 
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd &l

Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Todd
I add the following two options:
spark.sql.planner.sortMergeJoin=false
spark.shuffle.reduceLocality.enabled=false

But it still performs the same as not setting them two.

One thing is that on the spark ui, when I click the SQL tab, it shows an empty 
page but the header title 'SQL',there is no table to show queries and execution 
plan information.








At 2015-09-11 14:39:06, "Todd"  wrote:


Thanks Hao.
 Yes,it is still low as SMJ。Let me try the option your suggested,




At 2015-09-11 14:34:46, "Cheng, Hao"  wrote:


You mean the performance is still slow as the SMJ in Spark 1.5?

 

Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
default, but we found it probably causes the performance reduce dramatically.

 

 

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 2:17 PM
To: Cheng, Hao
Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
spark 1.4.1 SQL

 

Thanks Hao for the reply.
I turn the merge sort join off, the physical plan is below, but the performance 
is roughly the same as it on...

== Physical Plan ==
TungstenProject 
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
 ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
  TungstenExchange hashpartitioning(ss_item_sk#2)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
  TungstenExchange hashpartitioning(ss_item_sk#25)
   ConvertToUnsafe
Scan 
ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]

Code Generation: true







At 2015-09-11 13:48:23, "Cheng, Hao"  wrote:



This is not a big surprise the SMJ is slower than the HashJoin, as we do not 
fully utilize the sorting yet, more details can be found at 
https://issues.apache.org/jira/browse/SPARK-2926 .

 

Anyway, can you disable the sort merge join by 
“spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query again? 
In our previous testing, it’s about 20% slower for sort merge join. I am not 
sure if there anything else slow down the performance.

 

Hao

 

 

From: Jesse F Chen [mailto:jfc...@us.ibm.com]
Sent: Friday, September 11, 2015 1:18 PM
To: Michael Armbrust
Cc: Todd; user@spark.apache.org
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL

 

Could this be a build issue (i.e., sbt package)?

If I ran the same jar build for 1.4.1 in 1.5, I am seeing large regression too 
in queries (all other things identical)...

I am curious, to build 1.5 (when it isn't released yet), what do I need to do 
with the build.sbt file?

any special parameters i should be using to make sure I load the latest hive 
dependencies?

Michael Armbrust ---09/10/2015 11:07:28 AM---I've been running TPC-DS SF=1500 
daily on Spark 1.4.1 and Spark 1.5 on S3, so this is surprising.  I

From: Michael Armbrust 
To: Todd 
Cc: "user@spark.apache.org" 
Date: 09/10/2015 11:07 AM
Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 
1.4.1 SQL




I've been running TPC-DS SF=1500 daily on Spark 1.4.1 and Spark 1.5 on S3, so 
this is surprising.  In my experiments Spark 1.5 is either the same or faster 
than 1.4 with only small exceptions.  A few thoughts,

 - 600 partitions is probably way too many for 6G of data.
 - Providing the output of explain for both runs would be helpful whenever 
reporting performance changes.

On Thu, Sep 10, 2015 at 1:24 AM, Todd  wrote:

Hi,

I am using data generated with 
sparksqlperf(https://github.com/databricks/spark-sql-perf) to test the spark 
sql performance (spark on yarn, with 10 nodes) with the following code (The 
table store_sales is about 90 million records, 6G in size)
 
val outputDir="hdfs://tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales"
val name="store_sales"
sqlContext.sql(
  s"""
  |CREATE TEMPORARY TABLE ${name}
  |USING org.apache.spark.sql.parquet
  |OPTIONS (
  |  path '${outputDir}'
  |)
""".stripMargin)

val sql="""
 |select
 |  t1.ss_quantity,
 |  t1.ss_list_price,
 |  t1.ss_coupon_amt,
 |  t1.ss_cdemo_sk,
 |  t1.ss_item_sk,
 |  t1.ss_promo_sk,
 |  t1.ss_sold_date_sk
 |from store_sales t1 join store_sales t2 on t1.ss_item_sk = 
t2.ss_item_sk
 |where
 |  t1.ss_sold_date_sk between 2450815 and 2451179
   """.stripMargin

val df = sqlContext.sql(sql)
df.rdd.foreach(row=>Unit)

With 

Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
Thanks, I'm surprised to see there are so much difference (4x), there
could be something wrong in Spark (some contention between tasks).

On Fri, Sep 11, 2015 at 11:47 AM, Jesse F Chen <jfc...@us.ibm.com> wrote:
>
> @Davies...good question..
>
> > Just be curious how the difference would be if you use 20 executors
> > and 20G memory for each executor..
>
> So I tried the following combinations:
>
> (GB X # executors)  (query response time in secs)
> 20X20 415
> 10X40 230
> 5X80 141
> 4X100 128
> 2X200 104
>
> CPU utilization is high so spreading more JVMs onto more vCores helps in this 
> case.
> For other workloads where memory utilization outweighs CPU, i can see larger 
> JVM
> sizes maybe more beneficial. It's for sure case-by-case.
>
> Seems overhead for codegen and scheduler overhead are negligible.
>
>
>
> Davies Liu ---09/11/2015 10:41:23 AM---On Fri, Sep 11, 2015 at 10:31 AM, 
> Jesse F Chen <jfc...@us.ibm.com> wrote: >
>
> From: Davies Liu <dav...@databricks.com>
> To: Jesse F Chen/San Francisco/IBM@IBMUS
> Cc: "Cheng, Hao" <hao.ch...@intel.com>, Todd <bit1...@163.com>, Michael 
> Armbrust <mich...@databricks.com>, "user@spark.apache.org" 
> <user@spark.apache.org>
> Date: 09/11/2015 10:41 AM
> Subject: Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
> 
>
>
>
> On Fri, Sep 11, 2015 at 10:31 AM, Jesse F Chen <jfc...@us.ibm.com> wrote:
> >
> > Thanks Hao!
> >
> > I tried your suggestion of setting 
> > spark.shuffle.reduceLocality.enabled=false and my initial tests showed 
> > queries are on par between 1.5 and 1.4.1.
> >
> > Results:
> >
> > tpcds-query39b-141.out:query time: 129.106478631 sec
> > tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284296 sec
> > tpcds-query39b-150.out:query time: 572.443151734 sec
> >
> > With default  spark.shuffle.reduceLocality.enabled=true, I am seeing 
> > across-the-board slow down for majority of the TPCDS queries.
> >
> > My test is on a bare metal 20-node cluster. I ran the my test as follows:
> >
> > /TestAutomation/spark-1.5/bin/spark-submit  --master yarn-client  
> > --packages com.databricks:spark-csv_2.10:1.1.0 --name TPCDSSparkSQLHC
> > --conf spark.shuffle.reduceLocality.enabled=false
> > --executor-memory 4096m --num-executors 100
> > --class org.apache.spark.examples.sql.hive.TPCDSSparkSQLHC
> > /TestAutomation/databricks/spark-sql-perf-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar
> > hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g
> > /TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql
> >
>
> Just be curious how the difference would be if you use 20 executors
> and 20G memory for each executor. Share the same JVM for some tasks,
> could reduce the overhead for codegen and JIT, it may also reduce the
> overhead of `reduceLocality`(it can be easier to schedule the tasks).
>
> >
> >
> >
> > "Cheng, Hao" ---09/11/2015 01:00:28 AM---Can you confirm if the query 
> > really run in the cluster mode? Not the local mode. Can you print the c
> >
> > From: "Cheng, Hao" <hao.ch...@intel.com>
> > To: Todd <bit1...@163.com>
> > Cc: Jesse F Chen/San Francisco/IBM@IBMUS, Michael Armbrust 
> > <mich...@databricks.com>, "user@spark.apache.org" <user@spark.apache.org>
> > Date: 09/11/2015 01:00 AM
> > Subject: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> > compared with spark 1.4.1 SQL
> >
> > ____________
> >
> >
> >
> > Can you confirm if the query really run in the cluster mode? Not the local 
> > mode. Can you print the call stack of the executor when the query is 
> > running?
> >
> > BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, 
> > not Spark SQL.
> >
> > From: Todd [mailto:bit1...@163.com]
> > Sent: Friday, September 11, 2015 3:39 PM
> > To: Todd
> > Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> > Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> > compared with spark 1.4.1 SQL
> >
> > I add the following two options:
> > spark.sql.planner.sortMergeJoin=false
> > spark.shuffle.reduceLocality.enabled=false
> >
> > But it still performs the same as not setting them two.
> >
> > One thing is that on the spark ui, when I cl

Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Jesse F Chen

@Davies...good question..

> Just be curious how the difference would be if you use 20 executors
> and 20G memory for each executor..

So I tried the following combinations:

(GB X # executors)  (query response time in secs)
20X20   415
10X40   230
5X80141
4X100   128
2X200   104

CPU utilization is high so spreading more JVMs onto more vCores helps in
this case.
For other workloads where memory utilization outweighs CPU, i can see
larger JVM
sizes maybe more beneficial. It's for sure case-by-case.

Seems overhead for codegen and scheduler overhead are negligible.


  

  

  






From:   Davies Liu <dav...@databricks.com>
To: Jesse F Chen/San Francisco/IBM@IBMUS
Cc: "Cheng, Hao" <hao.ch...@intel.com>, Todd <bit1...@163.com>,
Michael Armbrust <mich...@databricks.com>,
"user@spark.apache.org" <user@spark.apache.org>
Date:   09/11/2015 10:41 AM
Subject:Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by
50%+ compared with spark 1.4.1 SQL



On Fri, Sep 11, 2015 at 10:31 AM, Jesse F Chen <jfc...@us.ibm.com> wrote:
>
> Thanks Hao!
>
> I tried your suggestion of setting
spark.shuffle.reduceLocality.enabled=false and my initial tests showed
queries are on par between 1.5 and 1.4.1.
>
> Results:
>
> tpcds-query39b-141.out:query time: 129.106478631 sec
> tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284296 sec
> tpcds-query39b-150.out:query time: 572.443151734 sec
>
> With default  spark.shuffle.reduceLocality.enabled=true, I am seeing
across-the-board slow down for majority of the TPCDS queries.
>
> My test is on a bare metal 20-node cluster. I ran the my test as follows:
>
> /TestAutomation/spark-1.5/bin/spark-submit  --master yarn-client
--packages com.databricks:spark-csv_2.10:1.1.0 --name TPCDSSparkSQLHC
> --conf spark.shuffle.reduceLocality.enabled=false
> --executor-memory 4096m --num-executors 100
> --class org.apache.spark.examples.sql.hive.TPCDSSparkSQLHC
> /TestAutomation/databricks/spark-sql-perf-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar

> hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g
> /TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql

>

Just be curious how the difference would be if you use 20 executors
and 20G memory for each executor. Share the same JVM for some tasks,
could reduce the overhead for codegen and JIT, it may also reduce the
overhead of `reduceLocality`(it can be easier to schedule the tasks).

>
>
>
> "Cheng, Hao" ---09/11/2015 01:00:28 AM---Can you confirm if the query
really run in the cluster mode? Not the local mode. Can you print the c
>
> From: "Cheng, Hao" <hao.ch...@intel.com>
> To: Todd <bit1...@163.com>
> Cc: Jesse F Chen/San Francisco/IBM@IBMUS, Michael Armbrust
<mich...@databricks.com>, "user@spark.apache.org" <user@spark.apache.org>
> Date: 09/11/2015 01:00 AM
> Subject: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by
50%+ compared with spark 1.4.1 SQL
>
> 
>
>
>
> Can you confirm if the query really run in the cluster mode? Not the
local mode. Can you print the call stack of the executor when the query is
running?
>
> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark,
not Spark SQL.
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 3:39 PM
> To: Todd
> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+
compared with spark 1.4.1 SQL
>
> I add the following two options:
> spark.sql.planner.sortMergeJoin=false
> spark.shuffle.reduceLocality.enabled=false
>
> But it still performs the same as not setting them two.
>
> One thing is that on the spark ui, when I click the SQL tab, it shows an
empty page but the header title 'SQL',there is no table to show queries and
execution plan information.
>
>
>
>
> At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote:
>
>
> Thanks Hao.
> Yes,it is still low as SMJ。Let me try the option your suggested,
>
>
> At 2015-09-11 14:34:46, "Cheng, Hao" <

Re: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Davies Liu
On Fri, Sep 11, 2015 at 10:31 AM, Jesse F Chen <jfc...@us.ibm.com> wrote:
>
> Thanks Hao!
>
> I tried your suggestion of setting spark.shuffle.reduceLocality.enabled=false 
> and my initial tests showed queries are on par between 1.5 and 1.4.1.
>
> Results:
>
> tpcds-query39b-141.out:query time: 129.106478631 sec
> tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284296 sec
> tpcds-query39b-150.out:query time: 572.443151734 sec
>
> With default  spark.shuffle.reduceLocality.enabled=true, I am seeing 
> across-the-board slow down for majority of the TPCDS queries.
>
> My test is on a bare metal 20-node cluster. I ran the my test as follows:
>
> /TestAutomation/spark-1.5/bin/spark-submit  --master yarn-client  --packages 
> com.databricks:spark-csv_2.10:1.1.0 --name TPCDSSparkSQLHC
> --conf spark.shuffle.reduceLocality.enabled=false
> --executor-memory 4096m --num-executors 100
> --class org.apache.spark.examples.sql.hive.TPCDSSparkSQLHC
> /TestAutomation/databricks/spark-sql-perf-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar
> hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g
> /TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql
>

Just be curious how the difference would be if you use 20 executors
and 20G memory for each executor. Share the same JVM for some tasks,
could reduce the overhead for codegen and JIT, it may also reduce the
overhead of `reduceLocality`(it can be easier to schedule the tasks).

>
>
>
> "Cheng, Hao" ---09/11/2015 01:00:28 AM---Can you confirm if the query really 
> run in the cluster mode? Not the local mode. Can you print the c
>
> From: "Cheng, Hao" <hao.ch...@intel.com>
> To: Todd <bit1...@163.com>
> Cc: Jesse F Chen/San Francisco/IBM@IBMUS, Michael Armbrust 
> <mich...@databricks.com>, "user@spark.apache.org" <user@spark.apache.org>
> Date: 09/11/2015 01:00 AM
> Subject: RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
> 
>
>
>
> Can you confirm if the query really run in the cluster mode? Not the local 
> mode. Can you print the call stack of the executor when the query is running?
>
> BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark, not 
> Spark SQL.
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 3:39 PM
> To: Todd
> Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ 
> compared with spark 1.4.1 SQL
>
> I add the following two options:
> spark.sql.planner.sortMergeJoin=false
> spark.shuffle.reduceLocality.enabled=false
>
> But it still performs the same as not setting them two.
>
> One thing is that on the spark ui, when I click the SQL tab, it shows an 
> empty page but the header title 'SQL',there is no table to show queries and 
> execution plan information.
>
>
>
>
> At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote:
>
>
> Thanks Hao.
> Yes,it is still low as SMJ。Let me try the option your suggested,
>
>
> At 2015-09-11 14:34:46, "Cheng, Hao" <hao.ch...@intel.com> wrote:
>
> You mean the performance is still slow as the SMJ in Spark 1.5?
>
> Can you set the spark.shuffle.reduceLocality.enabled=false when you start the 
> spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true by 
> default, but we found it probably causes the performance reduce dramatically.
>
>
> From: Todd [mailto:bit1...@163.com]
> Sent: Friday, September 11, 2015 2:17 PM
> To: Cheng, Hao
> Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
> Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with 
> spark 1.4.1 SQL
>
> Thanks Hao for the reply.
> I turn the merge sort join off, the physical plan is below, but the 
> performance is roughly the same as it on...
>
> == Physical Plan ==
> TungstenProject 
> [ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]
> ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
>  TungstenExchange hashpartitioning(ss_item_sk#2)
>   ConvertToUnsafe
>Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]
>  TungstenExchange hashpartitioning(ss_item_sk#25)
>   ConvertToUnsafe
>Scan 
> ParquetRelation[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]
>
> Code Generation: true

RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Jesse F Chen

  Thanks Hao!

  I tried your suggestion of setting spark.shuffle.reduceLocality.enabled
=false and my initial tests showed queries are on par between 1.5 and
1.4.1.

  Results:

tpcds-query39b-141.out:query time: 129.106478631 sec
tpcds-query39b-150-reduceLocality-false.out:query time: 128.854284296 sec
tpcds-query39b-150.out:query time: 572.443151734 sec

With default  spark.shuffle.reduceLocality.enabled=true, I am seeing
across-the-board slow down for majority of the TPCDS queries.

My test is on a bare metal 20-node cluster. I ran the my test as follows:

/TestAutomation/spark-1.5/bin/spark-submit  --master yarn-client
--packages com.databricks:spark-csv_2.10:1.1.0 --name TPCDSSparkSQLHC
--conf spark.shuffle.reduceLocality.enabled=false
--executor-memory 4096m --num-executors 100
--class org.apache.spark.examples.sql.hive.TPCDSSparkSQLHC
/TestAutomation/databricks/spark-sql-perf-master/target/scala-2.10/tpcdssparksql_2.10-0.9.jar

hdfs://rhel2.cisco.com:8020/user/bigsql/hadoopds100g
/TestAutomation/databricks/spark-sql-perf-master/src/main/queries/jesse/query39b.sql






From:   "Cheng, Hao" <hao.ch...@intel.com>
To: Todd <bit1...@163.com>
Cc: Jesse F Chen/San Francisco/IBM@IBMUS, Michael Armbrust
<mich...@databricks.com>, "user@spark.apache.org"
<user@spark.apache.org>
Date:   09/11/2015 01:00 AM
Subject:    RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by
50%+ compared with spark 1.4.1 SQL



Can you confirm if the query really run in the cluster mode? Not the local
mode. Can you print the call stack of the executor when the query is
running?

BTW: spark.shuffle.reduceLocality.enabled is the configuration of Spark,
not Spark SQL.

From: Todd [mailto:bit1...@163.com]
Sent: Friday, September 11, 2015 3:39 PM
To: Todd
Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org
Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+
compared with spark 1.4.1 SQL

I add the following two options:
spark.sql.planner.sortMergeJoin=false
spark.shuffle.reduceLocality.enabled=false

But it still performs the same as not setting them two.

One thing is that on the spark ui, when I click the SQL tab, it shows an
empty page but the header title 'SQL',there is no table to show queries and
execution plan information.




At 2015-09-11 14:39:06, "Todd" <bit1...@163.com> wrote:


 Thanks Hao.
  Yes,it is still low as SMJ。Let me try the option your suggested,


 At 2015-09-11 14:34:46, "Cheng, Hao" <hao.ch...@intel.com> wrote:

  You mean the performance is still slow as the SMJ in Spark 1.5?

  Can you set the spark.shuffle.reduceLocality.enabled=false when you start
  the spark-shell/spark-sql? It’s a new feature in Spark 1.5, and it’s true
  by default, but we found it probably causes the performance reduce
  dramatically.


  From: Todd [mailto:bit1...@163.com]
  Sent: Friday, September 11, 2015 2:17 PM
  To: Cheng, Hao
  Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org
  Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared
  with spark 1.4.1 SQL

  Thanks Hao for the reply.
  I turn the merge sort join off, the physical plan is below, but the
  performance is roughly the same as it on...

  == Physical Plan ==
  TungstenProject
  
[ss_quantity#10,ss_list_price#12,ss_coupon_amt#19,ss_cdemo_sk#4,ss_item_sk#2,ss_promo_sk#8,ss_sold_date_sk#0]

   ShuffledHashJoin [ss_item_sk#2], [ss_item_sk#25], BuildRight
TungstenExchange hashpartitioning(ss_item_sk#2)
 ConvertToUnsafe
  Scan ParquetRelation
  
[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_promo_sk#8,ss_quantity#10,ss_cdemo_sk#4,ss_list_price#12,ss_coupon_amt#19,ss_item_sk#2,ss_sold_date_sk#0]

TungstenExchange hashpartitioning(ss_item_sk#25)
 ConvertToUnsafe
  Scan ParquetRelation
  
[hdfs://ns1/tmp/spark_perf/scaleFactor=30/useDecimal=true/store_sales][ss_item_sk#25]


  Code Generation: true



  At 2015-09-11 13:48:23, "Cheng, Hao" <hao.ch...@intel.com> wrote:
  This is not a big surprise the SMJ is slower than the HashJoin, as we do
  not fully utilize the sorting yet, more details can be found at
  https://issues.apache.org/jira/browse/SPARK-2926 .

  Anyway, can you disable the sort merge join by
  “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and run the query
  again? In our previous testing, it’s about 20% slower for sort merge
  join. I am not sure if there anything else slow down the performance.

  Hao


  From: Jesse F Chen [mailto:jfc...@us.ibm.com]
  Sent: Friday, September 11, 2015 1:18 PM
  To: Michael Armbrust
  Cc: Todd; user@spark.apache.org
  Subject: Re: spark 1.5 SQL slows down dramatically by 50%+ compared with
  spark 1.4.1 SQL



  Could this be a build issue (i.e., sbt package)?

  If I ran the same jar build for 1.4.1 in 1.5, I am seeing large
  regression too in queri