RDD: Execution and Scheduling

gsvic Thu, 17 Sep 2015 05:23:59 -0700

After reading some parts of Spark source code I would like to make some
questions about RDD execution and scheduling.


At first, please correct me if I am wrong at the following:
1) The number of partitions equals to the number of tasks will be executed
in parallel (e.g. , when an RDD is repartitioned in 30 partitions, a count
aggregate will be executed in 30 tasks distributed in the cluster)

2) A  task
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/Task.scala>
  
concerns only one partition (partitionId: Int) and this partition maps to an
RDD block.

3) If and RDD is cached, then the preferred location for execution of this
Partition and the corresponding RDD block will be the node the data is
cached in.

The questions are the following:

I run some SQL aggregate functions on a TPCH dataset. The cluster is
consisted of 7 executors (and one driver) each one contains 8 GB RAM and 4
VCPUs. The dataset is in Parquet file format in an external Hadoop Cluster,
that is, Spark workers and Hadoop DataNodes are running on different VMs.

1) For a count aggregate, I repartitioned the DataFrame into 24 partitions
and each executor took 2 partitions(tasks) for execution. Is that always
happens the same way (the number of tasks per node is equal to
#Partitions/#Workers) ?

2) How Spark chooses the executor for each task if the data is not cached?
It's clear what happens if the data is cached in  DAGScheduler.scala
<https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1541>
 
, but what if is not? Is it possible to determine that before execution?

3) In the case of an SQL Join operation, is it possible to determine how
many tasks/partitions will be generated and in which worker each task be
submitted?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Execution-and-Scheduling-tp14177.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

RDD: Execution and Scheduling

Reply via email to