Thanks a lot!
I just realize the spark is not a really in-memory version of mapreduce J
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, January 13, 2015 3:53 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Why always spilling to disk and how to improve it
You could try setting the following to tweak the application a little bit:
.set("spark.rdd.compress","true")
.set("spark.storage.memoryFraction", "1")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
For shuffle behavior, you can look at this document
https
The distinct call causes a shuffle, which always results in data being
written to disk.
-Sven
On Tue, Jan 13, 2015 at 12:21 PM, Shuai Zheng wrote:
> Hi All,
>
>
>
> I am trying with some small data set. It is only 200m, and what I am doing
> is just do a distinct count on it.
>
> But there are a