Github user jinxing64 commented on the issue:
https://github.com/apache/spark/pull/21286
Does Spark have a jobID in writing path? Below path is an example in my
debugging log:
```
parquettest2/_temporary/0/_temporary/attempt_20180515215310_0000_m_000000_0/part-00000-9104445e-e54a-4e3f-9ba4-e624d60e6247-c000.snappy.parquet
```
parquettest2 is a non-partitioned table. Seems that the `jobAttemptId` in
`_temporary/$jobAttemptId/_temporary/$taskID_$taskAttemptID` is always 0 if no
retry.
If no unique jobID is provided in the writing path, think about below
scenario:
```
1. JobA started and writes data to
dir/tab/_temporary/0/_temporary/$taskID_$taskAttemptID
2. JobB started and writes data to
dir/tab/_temporary/0/_temporary/$taskID_$taskAttemptID
3. Note that JobA and JobB write data to dir/tab/_temporary/0/_temporary at
the same time
4. When JobA commits, all data under dir/tab/_temporary/0/ are commited as
the output -- Yes, it's a mixture from both JobA and jobB, the generated data
to the target table is incorrect.
5. When JobA commits and cleanup, dir/tab/_temporary/ will be deleted. But
at this moment, JobB is not finisehd yet and cann
ot find dir/tab/_temporary/0/ and failed.
```
If I understand correctly, this pr proposes to add a jobId outside the
_temporary and the writing path format is like below:
`$jobID/_temporary/$jobAttemptId/_temporary/$taskID_$taskAttemptID`.
Thus the change outside committer and doesn't break commiterr's logic.
Did I understand correctly ?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]