My use case as mentioned below.

1. Read input data from local file system using sparkContext.textFile(input
path).
2. partition the input data(80 million records) into partitions using
RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer
function. Without using coalesce() or repartition() on the input data spark
executes really slow and fails with out of memory exception.

The issue i am facing here is in deciding the number of partitions to be
applied on the input data. *The input data  size varies every time and hard
coding a particular value is not an option. And spark performs really well
only when certain optimum partition is applied on the input data for which i
have to perform lots of iteration(trial and error). Which is not an option
in a production environment.*

My question: Is there a thumb rule to decide the number of partitions
required depending on the input data size and cluster resources
available(executors,cores, etc...)? If yes please point me in that
direction. Any help  is much appreciated.

I am using spark 1.0 on yarn.

Thanks,
AG



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to