advancedxy commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives 
duplicate result when an application was killed
URL: https://github.com/apache/spark/pull/25795#issuecomment-532570982
 
 
   > @advancedxy can you give a completed proposal for it?
   
   All right, I think the requirements can be split into two parts:
   
   1. support concurrent writes to different locations(partitions).
       It's achieved by setting different output path for different writes:
       * For `dynamicPartitionOverwrite`, the output could be the staging 
dir(current solution of #25739), which is unique from each other. 
        * For  `dynamicPartitionsOverwrite=false` and partitioned table, the 
output in the `OutputCommitter` could be 
`$table_output/static_part_key1=value1/static_part_key2=value2/...`. Concurrent 
writes to partitions prefixed by different static partitions won't interfere 
each other. This could be extended in #25379. 
        * For non-partitioned table, there's only one output, don't support 
concurrent writes.
   2. detect concurrent writes to the same location and fail fast.
       This can be archived during `setupJob` stage. We can check the existence 
of output path like the `FileOutputFormat` did. If the output path has already 
been existed, it must be created by other concurrent writing job or left by 
previous failed/killed job. We can throw an exception with the possible reasons 
and fails the current job. Of course, we cannot simple check the output passed 
to JobConf as the $table_output should be presented(unless the first time to 
create table). $table_output/_temporary/$app_attempt_num could be a good 
candidate.
   
      One more thing to do in Spark, spark should infer yarn app attempt num 
when running under yarn mode. Currently, the app attempt num is always 0 when 
writing.
   
   I believe the approach proposal should covers concurrent writes and case in 
this pr. WDYT @cloud-fan, @turboFei and @wangyum 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to