Hi,

Say, I have a table with 1 column and 1000 rows. I want to save the result
in a RDBMS table using the jdbc relation provider. So I run the following
query,

"insert into table table2 select value, count(*) from table1 group by value
order by value"

While debugging, I found that the resultant df from select value, count(*)
from table1 group by value order by value would have around 200+ partitions
and say I have 4 executors attached to my driver. So, I would have 200+
writing tasks assigned to 4 executors. I want to understand, how these
executors are able to write the data to the underlying RDBMS table of
table2 without messing up the order.

I checked the jdbc insertable relation and in jdbcUtils [1] it does the
following

df.foreachPartition { iterator =>
      savePartition(getConnection, table, iterator, rddSchema, nullTypes,
batchSize, dialect)
    }

So, my understanding is, all of my 4 executors will parallely run the
savePartition function (or closure) where they do not know which one should
write data before the other!

In the savePartition method, in the comment, it says
"Saves a partition of a DataFrame to the JDBC database.  This is done in
   * a single database transaction in order to avoid repeatedly inserting
   * data as much as possible."

I want to understand, how these parallel executors save the partition
without harming the order of the results? Is it by locking the database
resource, from each executor (i.e. ex0 would first obtain a lock for the
table and write the partition0, while ex1 ... ex3 would wait till the lock
is released )?

In my experience, there is no harm done to the order of the results at the
end of the day!

Would like to hear from you guys! :-)

[1]
https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277

-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+94 71 554 8430
https://www.linkedin.com/in/niranda
https://pythagoreanscript.wordpress.com/

Reply via email to