[ 
https://issues.apache.org/jira/browse/SPARK-27716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27716:
----------------------------
    Description: 
With the jdbc datasource, we can save a rdd to the database.

The comments for the function saveTable is that.

  /**
   * Saves the RDD to the database in a single transaction.
   */
  def saveTable(
      df: DataFrame,
      tableSchema: Option[StructType],
      isCaseSensitive: Boolean,
      options: JdbcOptionsInWrite)
In fact, it is not true.

The savePartition operation is in a single transaction but the saveTable 
operation is not in a single transaction.

There are several cases of data transmission:

case1: Append data to origin existed gptable.
case2: Overwrite origin gptable, but the table is a cascadingTruncateTable, so 
we can not drop the gptable, we have to truncate it and append data.
case3: Overwrite origin existed table and the table is not a 
cascadingTruncateTable, so we can drop it first.
case4: For an unexisted table, create and transmit data.
In this PR, I add a transactions support for case3 and case4.

For case3 and case4, we can transmit the rdd to a temp table at first.

We use an accumulator to record the suceessful savePartition operations.

At last, we compare the value of accumulator with dataFrame's partitionNum.

If all the savePartition operations are successful, we drop the origin table if 
it exists, then we alter the temp table rename to origin table.

> Complete the transactions support for part of jdbc datasource operations.
> -------------------------------------------------------------------------
>
>                 Key: SPARK-27716
>                 URL: https://issues.apache.org/jira/browse/SPARK-27716
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.3
>            Reporter: feiwang
>            Priority: Major
>
> With the jdbc datasource, we can save a rdd to the database.
> The comments for the function saveTable is that.
>   /**
>    * Saves the RDD to the database in a single transaction.
>    */
>   def saveTable(
>       df: DataFrame,
>       tableSchema: Option[StructType],
>       isCaseSensitive: Boolean,
>       options: JdbcOptionsInWrite)
> In fact, it is not true.
> The savePartition operation is in a single transaction but the saveTable 
> operation is not in a single transaction.
> There are several cases of data transmission:
> case1: Append data to origin existed gptable.
> case2: Overwrite origin gptable, but the table is a cascadingTruncateTable, 
> so we can not drop the gptable, we have to truncate it and append data.
> case3: Overwrite origin existed table and the table is not a 
> cascadingTruncateTable, so we can drop it first.
> case4: For an unexisted table, create and transmit data.
> In this PR, I add a transactions support for case3 and case4.
> For case3 and case4, we can transmit the rdd to a temp table at first.
> We use an accumulator to record the suceessful savePartition operations.
> At last, we compare the value of accumulator with dataFrame's partitionNum.
> If all the savePartition operations are successful, we drop the origin table 
> if it exists, then we alter the temp table rename to origin table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to