Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Maciej Szymkiewicz Tue, 22 Nov 2016 03:54:08 -0800

On 11/22/2016 12:11 PM, nirandap wrote:

> Hi Maciej, 
>
> Thank you for your reply. 
>
> I have 2 queries.
> 1. I can understand your explanation. But in my experience, when I
> check the final RDBMS table, I see that the results follow the
> expected order, without an issue. Is this just a coincidence?
Not exactly a coincidence. This is typically a result of a physical
location on the disk. If writes and reads are sequential, (this is
usually the case) you'll see things in the expected order, but you have
to remember that location on disk is not stable. For example if you
perform some updates, deletes and VACUM ALL (PostgreSQL) physical
location on disk will change and with it things you see.


There of course more advanced mechanisms out there. For example modern
columnar RDBMS like HANA use techniques like dimensions sorting and
differential stores so even the initial order may differ. There probably
some other solutions which choose different strategies (maybe some times
series oriented projects?) I am not aware of.

>
> 2. I was further looking into this. So, say I run this query
> "select value, count(*) from table1 group by value order by value"
>
> and I call df.collect() in the resultant dataframe. From my
> experience, I see that the given values follow the expected order. May
> I know how spark manages to retain the order of the results in a
> collect operation?
Once you execute ordered operation each partition is sorted and the
order of partitions defines the global ordering. All what collect does
is just preserving this order by creating an array of results for each
partition and flattening it.
>
> Best 
>
>
> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
> Developers List] <[hidden email]
> </user/SendEmail.jtp?type=node&node=19985&i=0>> wrote:
>
>     In commonly used RDBM systems relations have no fixed order and
>     physical location of the records can change during routine
>     maintenance operations. Unless you explicitly order data during
>     retrieval order you see is incidental and not guaranteed. 
>
>     Conclusion: order of inserts just doesn't matter.
>
>     On 11/21/2016 10:03 AM, Niranda Perera wrote:
>>     Hi, 
>>
>>     Say, I have a table with 1 column and 1000 rows. I want to save
>>     the result in a RDBMS table using the jdbc relation provider. So
>>     I run the following query, 
>>
>>     "insert into table table2 select value, count(*) from table1
>>     group by value order by value"
>>
>>     While debugging, I found that the resultant df from select value,
>>     count(*) from table1 group by value order by value would have
>>     around 200+ partitions and say I have 4 executors attached to my
>>     driver. So, I would have 200+ writing tasks assigned to 4
>>     executors. I want to understand, how these executors are able to
>>     write the data to the underlying RDBMS table of table2 without
>>     messing up the order. 
>>
>>     I checked the jdbc insertable relation and in jdbcUtils [1] it
>>     does the following
>>
>>     df.foreachPartition { iterator =>
>>           savePartition(getConnection, table, iterator, rddSchema,
>>     nullTypes, batchSize, dialect)
>>         }
>>
>>     So, my understanding is, all of my 4 executors will parallely run
>>     the savePartition function (or closure) where they do not know
>>     which one should write data before the other! 
>>
>>     In the savePartition method, in the comment, it says 
>>     "Saves a partition of a DataFrame to the JDBC database.  This is
>>     done in
>>        * a single database transaction in order to avoid repeatedly
>>     inserting
>>        * data as much as possible."
>>
>>     I want to understand, how these parallel executors save the
>>     partition without harming the order of the results? Is it by
>>     locking the database resource, from each executor (i.e. ex0 would
>>     first obtain a lock for the table and write the partition0, while
>>     ex1 ... ex3 would wait till the lock is released )? 
>>
>>     In my experience, there is no harm done to the order of the
>>     results at the end of the day! 
>>
>>     Would like to hear from you guys! :-) 
>>
>>     [1] 
>> https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277
>>     
>> <https://github.com/apache/spark/blob/v1.6.2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L277>
>>
>>     -- 
>>     Niranda Perera
>>     @n1r44 <https://twitter.com/N1R44>
>>     <a href="tel:%2B94%2071%20554%208430" value="+94715548430"
>>     target="_blank">+94 71 554 8430
>>     https://www.linkedin.com/in/niranda
>>     <https://www.linkedin.com/in/niranda>
>>     https://pythagoreanscript.wordpress.com/
>>     <https://pythagoreanscript.wordpress.com/>
>
>     -- 
>     Best regards,
>     Maciej Szymkiewicz
>
>
>
>     ------------------------------------------------------------------------
>     If you reply to this email, your message will be added to the
>     discussion below:
>     
> http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19971.html
>     
> <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19971.html>
>
>     To start a new topic under Apache Spark Developers List, email
>     [hidden email] </user/SendEmail.jtp?type=node&node=19985&i=1>
>     To unsubscribe from Apache Spark Developers List, click here.
>     NAML
>     
> <http://apache-spark-developers-list.1001551.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
>
>
>
>
> -- 
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +94 71 554 8430
> https://www.linkedin.com/in/niranda
> https://pythagoreanscript.wordpress.com/
>
> ------------------------------------------------------------------------
> View this message in context: Re: How is the order ensured in the jdbc
> relation provider when inserting data from multiple executors
> <http://apache-spark-developers-list.1001551.n3.nabble.com/How-is-the-order-ensured-in-the-jdbc-relation-provider-when-inserting-data-from-multiple-executors-tp19970p19985.html>
> Sent from the Apache Spark Developers List mailing list archive
> <http://apache-spark-developers-list.1001551.n3.nabble.com/> at
> Nabble.com.

-- 
Maciej Szymkiewicz

Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

Reply via email to