Re: Why Spark JDBC Writing in a sequential order

Jörn Franke Fri, 25 May 2018 07:50:44 -0700

Can your database receive the writes concurrently ? Ie do you make sure that 
each executor writes into a different partition at database side ?


> On 25. May 2018, at 16:42, Yong Zhang <java8...@hotmail.com> wrote:
> 
> Spark version 2.2.0
> 
> 
> We are trying to write a DataFrame to remote relationship database (AWS 
> Redshift). Based on the Spark JDBC document, we already repartition our DF as 
> 12 and set the spark jdbc to concurrent writing for 12 partitions as 
> "numPartitions" parameter.
> 
> 
> We run the command as following:
> 
> dataframe.repartition(12).write.mode("overwrite").option("batchsize", 
> 5000).option("numPartitions", 12).jdbc(url=jdbcurl, table="tableName", 
> connectionProperties=connectionProps)
> 
> Here is the Spark UI:
> 
> 
> <Screen Shot 2018-05-25 at 10.21.50 AM.png>
> 
> We found out that the 12 tasks obviously are running in sequential order. 
> They are all in "Running" status in the beginning at the same time, but if we 
> check the "Duration" and "Shuffle Read Size/Records" of them, it is clear 
> that they are run one by one.
> For example, task 8 finished first in about 2 hours, and wrote 34732 records 
> to remote DB (I knew the speed looks terrible, but that's not the question of 
> this post), and task 0 started after task 8, and took 4 hours (first 2 hours 
> waiting for task 8). 
> In this picture, only task 2 and 4 are in running stage, but task 4 is 
> obviously waiting for task 2 to finish, then start writing after that.
> 
> My question is, in the above Spark command, my understanding that 12 
> executors should open the JDBC connection to the remote DB concurrently, and 
> all 12 tasks should start writing also in concurrent, and whole job should 
> finish around 2 hours overall.
> 
> Why 12 tasks indeed are in "RUNNING" stage, but looks like waiting for 
> something, and can ONLY write to remote DB sequentially? The 12 executors are 
> on different JVMs on different physical nodes. Why this is happening? What 
> stops Spark pushing the data truly concurrent?
> 
> Thanks
> 
> Yong 
>

Re: Why Spark JDBC Writing in a sequential order

Reply via email to