It looks like you have data in these 24 partitions, or more. How many unique name in your data set? Enlarge the shuffle partitions only make sense if you have large partition groups in your data. What you described looked like either your dataset having data in these 24 partitions, or you have skew data in these 24 partitions. If you really join a 56M data with 26M data, I am surprised that you will have 24 partitions running very slow, under 8G executor. Yong Date: Wed, 6 May 2015 14:04:11 +0800 From: luohui20...@sina.com To: luohui20...@sina.com; hao.ch...@intel.com; daoyuan.w...@intel.com; ssab...@gmail.com; user@spark.apache.org Subject: 回复:回复:RE: 回复:Re: sparksql running slow while joining_2_tables.
update status after i did some tests. I modified some other parameters, found 2 parameters maybe relative. spark_worker_instance and spark.sql.shuffle.partitions before Today I used default setting of spark_worker_instance and spark.sql.shuffle.partitions whose value is 1 and 200.At that time , my app stops running at 5/200tasks. then I changed spark_worker_instance to 2, then my app process moved on to about 116/200 tasks.and then changed spark_worker_instance to 4, then I can get a further progress at 176/200.however when i changed to 8 or even more ,like 12 works, it is still 176/200 Later new founds comes to me while I am trying with different spark.sql.shuffle.partitions. If I changed to 50,400,800 partitions, it stops at 26/50, 376/400,776/800 tasks. always leaving 24 tasks unable to finish. Not sure why those happens.Hope this info could be helpful to solve it. -------------------------------- Thanks&Best regards! 罗辉 San.Luo ----- 原始邮件 ----- 发件人:<luohui20...@sina.com> 收件人:"Cheng, Hao" <hao.ch...@intel.com>, "Wang, Daoyuan" <daoyuan.w...@intel.com>, "Olivier Girardot" <ssab...@gmail.com>, "user" <user@spark.apache.org>, 主题:回复:RE: 回复:Re: sparksql running slow while joining_2_tables. 日期:2015年05月06日 09点51分 db has 1.7million records while sample has 0.6million. jvm settings i tried default settings and also tried to apply 4g by "export _java_opts 4g", app still stops running. BTW, here are some details info about gc and jvm. ----- 原始邮件 ----- 发件人:"Cheng, Hao" <hao.ch...@intel.com> 收件人:"luohui20...@sina.com" <luohui20...@sina.com>, "Wang, Daoyuan" <daoyuan.w...@intel.com>, Olivier Girardot <ssab...@gmail.com>, user <user@spark.apache.org> 主题:RE: 回复:Re: sparksql running slow while joining_2_tables. 日期:2015年05月05日 20点50分 56mb / 26mb is very small size, do you observe data skew? More precisely, many records with the same chrname / name? And can you also double check the jvm settings for the executor process? From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 5, 2015 7:50 PM To: Cheng, Hao; Wang, Daoyuan; Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining_2_tables. Hi guys, attache the pic of physical plan and logs.Thanks. -------------------------------- Thanks&Best regards! 罗辉 San.Luo ----- 原始邮件 ----- 发件人:"Cheng, Hao" <hao.ch...@intel.com> 收件人:"Wang, Daoyuan" <daoyuan.w...@intel.com>, "luohui20...@sina.com" <luohui20...@sina.com>, Olivier Girardot <ssab...@gmail.com>, user <user@spark.apache.org> 主题:Re: sparksql running slow while joining_2_tables. 日期:2015年05月05日 13点18分 I assume you’re using the DataFrame API within your application. sql(“SELECT…”).explain(true) From: Wang, Daoyuan Sent: Tuesday, May 5, 2015 10:16 AM To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. You can use Explain extended select …. From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 05, 2015 9:52 AM To: Cheng, Hao; Olivier Girardot; user Subject: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. As I know broadcastjoin is automatically enabled by spark.sql.autoBroadcastJoinThreshold. refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options and how to check my app's physical plan,and others things like optimized plan,executable plan.etc thanks -------------------------------- Thanks&Best regards! 罗辉 San.Luo ----- 原始邮件 ----- 发件人:"Cheng, Hao" <hao.ch...@intel.com> 收件人:"Cheng, Hao" <hao.ch...@intel.com>, "luohui20...@sina.com" <luohui20...@sina.com>, Olivier Girardot <ssab...@gmail.com>, user <user@spark.apache.org> 主题:RE: 回复:Re: sparksql running slow while joining_2_tables. 日期:2015年05月05日 08点38分 Or, have you ever try broadcast join? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, May 5, 2015 8:33 AM To: luohui20...@sina.com; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx… From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics . it seems like a GC issue. I also tried with different parameters like memory size of driver&executor, memory fraction, java opts... but this issue still happens. -------------------------------- Thanks&Best regards! 罗辉 San.Luo ----- 原始邮件 ----- 发件人:Olivier Girardot <ssab...@gmail.com> 收件人:luohui20...@sina.com, user <user@spark.apache.org> 主题:Re: sparksql running slow while joining 2 tables. 日期:2015年05月04日 20点46分 Hi, What is you Spark version ? Regards, Olivier. Le lun. 4 mai 2015 à 11:03, <luohui20...@sina.com> a écrit : hi guys when i am running a sql like "select a.name,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.name = b.name) where (b.startpoint > a.startpoint + 25);" I found sparksql running slow in minutes which may caused by very long GC and shuffle time. table db is created from a txt file size at 56mb while table sample sized at 26mb, both at small size. my spark cluster is a standalone pseudo-distributed spark cluster with 8g executor and 4g driver manager. any advises? thank you guys. -------------------------------- Thanks&Best regards! 罗辉 San.Luo --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org