RE: Why always spilling to disk and how to improve it?

2015-01-14 Thread Shuai Zheng
Thanks a lot! I just realize the spark is not a really in-memory version of mapreduce J From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, January 13, 2015 3:53 PM To: Shuai Zheng Cc: user@spark.apache.org Subject: Re: Why always spilling to disk and how to improve it

Re: Why always spilling to disk and how to improve it?

2015-01-13 Thread Akhil Das
You could try setting the following to tweak the application a little bit: .set("spark.rdd.compress","true") .set("spark.storage.memoryFraction", "1") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") For shuffle behavior, you can look at this document https

Re: Why always spilling to disk and how to improve it?

2015-01-13 Thread Sven Krasser
The distinct call causes a shuffle, which always results in data being written to disk. -Sven On Tue, Jan 13, 2015 at 12:21 PM, Shuai Zheng wrote: > Hi All, > > > > I am trying with some small data set. It is only 200m, and what I am doing > is just do a distinct count on it. > > But there are a