RE: 回复：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.

java8964 Wed, 06 May 2015 17:35:35 -0700

It looks like you have data in these 24 partitions, or more. How many unique 
name in your data set?
Enlarge the shuffle partitions only make sense if you have large partition 
groups in your data. What you described looked like either your dataset having 
data in these 24 partitions, or you have skew data in these 24 partitions.
If you really join a 56M data with 26M data, I am surprised that you will have 
24 partitions running very slow, under 8G executor.
Yong
Date: Wed, 6 May 2015 14:04:11 +0800
From: luohui20...@sina.com
To: luohui20...@sina.com; hao.ch...@intel.com; daoyuan.w...@intel.com; 
ssab...@gmail.com; user@spark.apache.org
Subject: 回复：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.


update status after i did some tests. I modified some other parameters, found 2 
parameters maybe relative.
spark_worker_instance and spark.sql.shuffle.partitions


before Today I used default setting of spark_worker_instance and 
spark.sql.shuffle.partitions whose value is 1 and 200.At that time , my app 
stops running at 5/200tasks.


then I changed spark_worker_instance to 2, then my app process moved on to 
about 116/200 tasks.and then changed spark_worker_instance to 4, then I can get 
a further progress at 176/200.however when i changed to 8 or even more ,like 12 
works, it is still 176/200


Later new founds comes to me while I am trying with different 
spark.sql.shuffle.partitions. If I changed to 50,400,800 partitions, it stops 
at 26/50, 376/400,776/800 tasks. always leaving 24 tasks unable to finish.


Not sure why those happens.Hope this info could be helpful to solve it.



--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo

----- 原始邮件 -----
发件人：<luohui20...@sina.com>
收件人："Cheng, Hao" <hao.ch...@intel.com>, "Wang, Daoyuan" 
<daoyuan.w...@intel.com>, "Olivier Girardot" <ssab...@gmail.com>, "user" 
<user@spark.apache.org>,
主题：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月06日 09点51分

db has 1.7million records while sample has 0.6million. jvm settings i tried 
default settings and also tried to apply 4g by "export _java_opts 4g", app 
still stops running.
BTW, here are some details info about gc and jvm.
----- 原始邮件 -----
发件人："Cheng, Hao" <hao.ch...@intel.com>
收件人："luohui20...@sina.com" <luohui20...@sina.com>, "Wang, Daoyuan" 
<daoyuan.w...@intel.com>, Olivier Girardot <ssab...@gmail.com>, user 
<user@spark.apache.org>
主题：RE: 回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 20点50分
56mb / 26mb is very small size, do you observe data skew? More precisely, many 
records with the same chrname / name?  And can you also double check the jvm 
settings
 for the executor process?
 
 
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 5, 2015 7:50 PM
To: Cheng, Hao; Wang, Daoyuan; Olivier Girardot; user
Subject: 回复：Re: sparksql running slow while joining_2_tables.
 
Hi guys,
          attache the pic of physical plan and logs.Thanks.
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 
原始邮件 -----
发件人："Cheng, Hao" <hao.ch...@intel.com>
收件人："Wang, Daoyuan" <daoyuan.w...@intel.com>, "luohui20...@sina.com" 
<luohui20...@sina.com>,
 Olivier Girardot <ssab...@gmail.com>, user <user@spark.apache.org>
主题：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 13点18分
 
I assume you’re using the DataFrame API within your application.
 
sql(“SELECT…”).explain(true)
 
From: Wang, Daoyuan
Sent: Tuesday, May 5, 2015 10:16 AM
To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user
Subject: RE: 回复：RE:
回复：Re: sparksql running slow while joining_2_tables.
 
You can use
Explain extended select ….
 
From:
luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Tuesday, May 05, 2015 9:52 AM
To: Cheng, Hao; Olivier Girardot; user
Subject: 回复：RE:
回复：Re: sparksql running slow while joining_2_tables.
 
As I know broadcastjoin is automatically enabled by 
spark.sql.autoBroadcastJoinThreshold.
refer to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
 
and how to check my app's physical plan,and others things like optimized 
plan,executable plan.etc
 
thanks
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 原始邮件 -----
发件人："Cheng, Hao" <hao.ch...@intel.com>
收件人："Cheng, Hao" <hao.ch...@intel.com>, "luohui20...@sina.com" 
<luohui20...@sina.com>,
 Olivier Girardot <ssab...@gmail.com>, user <user@spark.apache.org>
主题：RE: 
回复：Re: sparksql running slow while joining_2_tables.
日期：2015年05月05日 08点38分
 
Or, have you ever try broadcast join?
 
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Tuesday, May 5, 2015 8:33 AM
To: luohui20...@sina.com; Olivier Girardot; user
Subject: RE: 回复：Re: sparksql running slow while joining 2 tables.
 
Can you print out the physical plan?
 
EXPLAIN SELECT xxx…
 
From: luohui20...@sina.com [mailto:luohui20...@sina.com]
Sent: Monday, May 4, 2015 9:08 PM
To: Olivier Girardot; user
Subject: 回复：Re: sparksql running slow while joining 2 tables.
 
hi Olivier
spark1.3.1, with java1.8.0.45
and add 2 pics .
it seems like a GC issue. I also tried with different parameters like memory 
size of driver&executor, memory fraction, java opts...
but this issue still happens.
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
 
----- 原始邮件 -----
发件人：Olivier Girardot <ssab...@gmail.com>
收件人：luohui20...@sina.com, user <user@spark.apache.org>
主题：Re: sparksql running slow while joining 2 tables.
日期：2015年05月04日 20点46分
 
Hi, 
What is you Spark version ?
 
Regards, 
 
Olivier.
 
Le lun. 4 mai 2015 à 11:03, <luohui20...@sina.com> a
écrit :
hi guys
        when i am running a sql  like "select 
a.name,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.name =
b.name) where (b.startpoint > a.startpoint &#43; 25);" I found sparksql running 
slow in minutes which may caused by very long GC and shuffle time.
 
       table db is created from a txt file size at 56mb while table sample 
sized at 26mb, both at small size.
       my spark cluster is a standalone  pseudo-distributed spark cluster with 
8g executor and 4g driver manager.
       any advises? thank you guys.
 
 
--------------------------------
 
Thanks&amp;Best regards!
罗辉 San.Luo
---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org

RE: 回复：回复：RE: 回复：Re: sparksql running slow while joining_2_tables.

Reply via email to