Github user jinxing64 commented on the issue:

    https://github.com/apache/spark/pull/21286
  
    Does Spark have a jobID in writing path? Below path is an example in my 
debugging log:
    ```
    
parquettest2/_temporary/0/_temporary/attempt_20180515215310_0000_m_000000_0/part-00000-9104445e-e54a-4e3f-9ba4-e624d60e6247-c000.snappy.parquet
    ```
    parquettest2 is a non-partitioned table. Seems that the `jobAttemptId` in 
`_temporary/$jobAttemptId/_temporary/$taskID_$taskAttemptID` is always 0 if no 
retry.
    If no unique jobID is provided in the writing path, think about below 
scenario:
    ```
    1. JobA started and writes data to 
dir/tab/_temporary/0/_temporary/$taskID_$taskAttemptID
    2. JobB started and writes data to 
dir/tab/_temporary/0/_temporary/$taskID_$taskAttemptID
    3. Note that JobA and JobB write data to dir/tab/_temporary/0/_temporary at 
the same time
    4. When JobA commits, all data under dir/tab/_temporary/0/ are commited as 
the output -- Yes, it's a mixture from both JobA and jobB, the generated data 
to the target table is incorrect.
    5. When JobA commits and cleanup, dir/tab/_temporary/ will be deleted. But 
at this moment, JobB is not finisehd yet and cann
    ot find dir/tab/_temporary/0/ and failed.
    ```
    If I understand correctly, this pr proposes to add a jobId outside the 
_temporary and the writing path format is like below:
    `$jobID/_temporary/$jobAttemptId/_temporary/$taskID_$taskAttemptID`.
    Thus the change outside committer and doesn't break commiterr's logic.
    Did I understand correctly ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to