Can your database receive the writes concurrently ? Ie do you make sure that each executor writes into a different partition at database side ?
> On 25. May 2018, at 16:42, Yong Zhang <java8...@hotmail.com> wrote: > > Spark version 2.2.0 > > > We are trying to write a DataFrame to remote relationship database (AWS > Redshift). Based on the Spark JDBC document, we already repartition our DF as > 12 and set the spark jdbc to concurrent writing for 12 partitions as > "numPartitions" parameter. > > > We run the command as following: > > dataframe.repartition(12).write.mode("overwrite").option("batchsize", > 5000).option("numPartitions", 12).jdbc(url=jdbcurl, table="tableName", > connectionProperties=connectionProps) > > Here is the Spark UI: > > > <Screen Shot 2018-05-25 at 10.21.50 AM.png> > > We found out that the 12 tasks obviously are running in sequential order. > They are all in "Running" status in the beginning at the same time, but if we > check the "Duration" and "Shuffle Read Size/Records" of them, it is clear > that they are run one by one. > For example, task 8 finished first in about 2 hours, and wrote 34732 records > to remote DB (I knew the speed looks terrible, but that's not the question of > this post), and task 0 started after task 8, and took 4 hours (first 2 hours > waiting for task 8). > In this picture, only task 2 and 4 are in running stage, but task 4 is > obviously waiting for task 2 to finish, then start writing after that. > > My question is, in the above Spark command, my understanding that 12 > executors should open the JDBC connection to the remote DB concurrently, and > all 12 tasks should start writing also in concurrent, and whole job should > finish around 2 hours overall. > > Why 12 tasks indeed are in "RUNNING" stage, but looks like waiting for > something, and can ONLY write to remote DB sequentially? The 12 executors are > on different JVMs on different physical nodes. Why this is happening? What > stops Spark pushing the data truly concurrent? > > Thanks > > Yong >