Github user zheh12 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    I think I may not have described this issue clearly.
    
    First of all,the scene of the problem is this.
    
    When multiple applications simultaneously append data to the same parquet 
datasource table.
    
    They will run simultaneously and share the same output directory.
    
    ```
    FileOutputFormat.setOutputPath(job, new Path(outputSpec.outputPath))
    ```
    
    `ouputSepc` is the output table directory `skip_dir/tab1/`
    
    `skip_dir/tab/_temporary` will be created as temporary dir.
    
    But once one Job is successfully committed, it will run cleanupJob
    
    ```
    Path pendingJobAttemptsPath = getPendingJobAttemptsPath();
    
    fs.delete(pendingJobAttemptsPath, true);
    ```
    
    The pendingJobAttemptsPath is `skip_dir/tab1/_temporary`
    
    ```
    Private Path getPendingJobAttemptsPath() {
        Return getPendingJobAttemptsPath(getOutputPath());
    }
    
    Private static Path getPendingJobAttemptsPath(Path out) {
        Return new Path(out, PENDING_DIR_NAME);
    }
    
    Public static final String PENDING_DIR_NAME = "_temporary";
    ```
    
    After the job is committed, `skip_dir/tab1/_temporary` will be deleted. 
Then when other jobs attempt to commit, an error will be reported.
    
    Meanwhile, due to all applications share the same app appempt id, they 
write temporary data to the same temporary dir  `skip_dir/tab1/_temporary/0`. 
Data committed by the successful application is also corrupted.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to