[GitHub] [spark] turboFei commented on a change in pull request #25863: [SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table.

GitBox Wed, 25 Sep 2019 11:23:31 -0700

turboFei commented on a change in pull request #25863: 
[SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate 
result and support concurrent file source write operations write to different 
partitions in the same table.
URL: https://github.com/apache/spark/pull/25863#discussion_r328269028


 ##########
 File path: 
sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala
 ##########
 @@ -156,4 +189,66 @@ class PartitionedWriteSuite extends QueryTest with 
SharedSparkSession {
       }
     }
   }
+
+  test("Output path should be a staging output dir, whose last level path name 
is jobId," +
+    " when dynamicPartitionOverwrite is enabled") {
+    withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString) {
+      withTable("t") {
+        withSQLConf(SQLConf.FILE_COMMIT_PROTOCOL_CLASS.key ->
+          classOf[DetectCorrectOutputPathFileCommitProtocol].getName) {
+          Seq((1, 2)).toDF("a", "b")
+            .write
+            .partitionBy("b")
+            .mode("overwrite")
+            .saveAsTable("t")
+        }
+      }
+    }
+  }
+
+  test("Concurrent write to the same table with different partitions should be 
possible") {
+    withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString) {
+      withTable("t") {
+        val sem = new Semaphore(0)
+        Seq((1, 2)).toDF("a", "b")
+          .write
+          .partitionBy("b")
+          .mode("overwrite")
+          .saveAsTable("t")
+
+        val df1 = spark.range(0, 10).map(x => (x, 1)).toDF("a", "b")
+        val df2 = spark.range(0, 10).map(x => (x, 2)).toDF("a", "b")
+        val dfs = Seq(df1, df2)
+
+        var throwable: Option[Throwable] = None
+        for (i <- 0 until 2) {
+          new Thread {
+            override def run(): Unit = {
+              try {
+                dfs(i)
+                  .write
+                  .mode("overwrite")
+                  .insertInto("t")
+              } catch {
+                case t: Throwable =>
+                  throwable = Some(t)
+              } finally {
+                sem.release()
+              }
+            }
+          }.start()
+        }
+        // make sure writing table in two threads are executed.
+        sem.acquire(2)
+        throwable.foreach { t => throw improveStackTrace(t) }
+        checkAnswer(spark.sql("select a, b from t where b = 1"), df1)
+        checkAnswer(spark.sql("select a, b from t where b = 2"), df2)
+      }
+    }
 
 Review comment:
   Remove this UT, because dataFrame can not specify partition key value.
   So with our current approach, they can not write concurrently.
   
   
https://github.com/apache/spark/blob/a1b90bfc0faa2b4b2b7388443a734a217792d585/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L412-L420
   
   May be we can support specify partition key for data frame in the future.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] turboFei commented on a change in pull request #25863: [SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table.

Reply via email to