[GitHub] [spark] turboFei edited a comment on issue #25795: [WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed

GitBox Thu, 19 Sep 2019 20:12:46 -0700

turboFei edited a comment on issue #25795: [WIP][SPARK-29037][Core] Spark gives 
duplicate result when an application was killed
URL: https://github.com/apache/spark/pull/25795#issuecomment-533387014
 
 
   @cloud-fan 
   For the dynamicPartitionOverwrite=true, I propose my thought  here:
   
   
   
   We can unify the staging dir for both dynamic and static insert overwrite, 
base on the specifed  static partition keyValue pairs.
   
   ```scala
     private def stagingDir = {
       val stagingPath = ".spark-staging-" + jobId + "/" +
       staticPartitionKVS.map(kv => "sp_" + kv._1 + "=" +
   kv._2).mkString(File.separator)
       new Path(path, stagingPath)
   }
   ```
   
   For example below:
   
   `ta (c1 int, p1 int, p2 int, p3 int) partitioned by (p1, p2, p3)`
   
   ```sql
   insert overwrite table ta partition(p1=1,p2,p3) select ...
   // stagingDir: .spark-staging-${UUID}/sp_p1=1
   
   insert overwrite table ta partition(p1=1,p2=2,p3) select ...
   // stagingDir: .spark-staging-${UUID}/sp_p1=1/sp_p2=2
   
   insert overwrite table ta select ...
   // stagingDir: .spark-staging-${UUID}
   ```
   
   
   
   The stagingDir will be cleaned up after job finished.
   
   Before per insert, we should check the path whose name is started with 
`.spark-staging` and find the longest path with `sp_` prefix.
   
   
   
   For two paths A and B.
   
   If A is fully contained by  B or B is fully contained by A,  they can not 
concurrent write.
   
   For example:
   
   `A: 'sp_p1=1/sp_p2=2'  B: 'sp_p1=1' `  A is fully contained by B.
   
   
   
   `A: ''  B: 'sp_p1=1' `  A contains all other paths.
   
   
   
   `A: 'sp_p1=1/sp_p2=2'  B: 'sp_p1=1/sp_p2=3' `  A and B do not contain each 
other.
   
   
   
   About  whether use a UUID in staging dir, It  is determined  by the 
implementation of https://github.com/apache/spark/pull/25739 .
   
   
   
   
   
   
   
   In addition, this PR is dedicated to resolve the duplicat result issue 
originally.
   
   I have created a new PR https://github.com/apache/spark/pull/25863 for 
duplicate result issue, hope it can make sense. Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] turboFei edited a comment on issue #25795: [WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed

Reply via email to