Hi, Say, I have a table with 1 column and 1000 rows. I want to save the result in a RDBMS table using the jdbc relation provider. So I run the following query,
"insert into table table2 select value, count(*) from table1 group by value order by value" While debugging, I found that the resultant df from select value, count(*) from table1 group by value order by value would have around 200+ partitions and say I have 4 executors attached to my driver. So, I would have 200+ writing tasks assigned to 4 executors. I want to understand, how these executors are able to write the data to the underlying RDBMS table of table2 without messing up the order. I checked the jdbc insertable relation and in jdbcUtils [1] it does the following df.foreachPartition { iterator => savePartition(getConnection, table, iterator, rddSchema, nullTypes, batchSize, dialect) } So, my understanding is, all of my 4 executors will parallely run the savePartition function (or closure) where they do not know which one should write data before the other! In the savePartition method, in the comment, it says "Saves a partition of a DataFrame to the JDBC database. This is done in * a single database transaction in order to avoid repeatedly inserting * data as much as possible." I want to understand, how these parallel executors save the partition without harming the order of the results? Is it by locking the database resource, from each executor (i.e. ex0 would first obtain a lock for the table and write the partition0, while ex1 ... ex3 would wait till the lock is released )? In my experience, there is no harm done to the order of the results at the end of the day! Would like to hear from you guys! :-) [1] https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277 -- Niranda Perera @n1r44 <https://twitter.com/N1R44> +94 71 554 8430 https://www.linkedin.com/in/niranda https://pythagoreanscript.wordpress.com/