Have you tried the following method ? * Note: With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000, shuffle = true) will result in 1000 partitions with the * data distributed using a hash partitioner. */ def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
Cheers On Mon, Dec 21, 2015 at 2:47 AM, Zhiliang Zhu <zchl.j...@yahoo.com.invalid> wrote: > Dear All, > > For some rdd, while there is just one partition, then the operation & > arithmetic would only be single, the rdd has lose all the parallelism > benefit from spark system ... > > Is it exactly like that? > > Thanks very much in advance! > Zhiliang > > >