After reading some parts of Spark source code I would like to make some questions about RDD execution and scheduling.
At first, please correct me if I am wrong at the following: 1) The number of partitions equals to the number of tasks will be executed in parallel (e.g. , when an RDD is repartitioned in 30 partitions, a count aggregate will be executed in 30 tasks distributed in the cluster) 2) A task <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/Task.scala> concerns only one partition (partitionId: Int) and this partition maps to an RDD block. 3) If and RDD is cached, then the preferred location for execution of this Partition and the corresponding RDD block will be the node the data is cached in. The questions are the following: I run some SQL aggregate functions on a TPCH dataset. The cluster is consisted of 7 executors (and one driver) each one contains 8 GB RAM and 4 VCPUs. The dataset is in Parquet file format in an external Hadoop Cluster, that is, Spark workers and Hadoop DataNodes are running on different VMs. 1) For a count aggregate, I repartitioned the DataFrame into 24 partitions and each executor took 2 partitions(tasks) for execution. Is that always happens the same way (the number of tasks per node is equal to #Partitions/#Workers) ? 2) How Spark chooses the executor for each task if the data is not cached? It's clear what happens if the data is cached in DAGScheduler.scala <https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1541> , but what if is not? Is it possible to determine that before execution? 3) In the case of an SQL Join operation, is it possible to determine how many tasks/partitions will be generated and in which worker each task be submitted? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Execution-and-Scheduling-tp14177.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org