[GitHub] [spark] steveloughran commented on issue #25863: [SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table.

GitBox Mon, 30 Sep 2019 04:21:45 -0700

steveloughran commented on issue #25863: [SPARK-28945][SPARK-29037][CORE][SQL] 
Fix the issue that spark gives duplicate result and support concurrent file 
source write operations write to different partitions in the same table.
URL: https://github.com/apache/spark/pull/25863#issuecomment-536518159
 
 
   This scares me. The `FileOutputCommitter` code scares me already. It's why 
when I added a new committer extension point I chose not to go near that code 
and instead add a new superclass. It's scary code because
   
   * it's co-recursive in a way that hard to follow.
   * it's used in the MRv1 and V2 APIs
   * it implements to different commit algorithms in the same methods
   * there is no documentation other than that code.
   * one of those commit algorithms, doesn't have the failure semantics spark 
expects (network partition during task commit)
   * it's been widely subclassed in ways people aren't aware of.
   * there's lots of references to static methods which you can't subclass when 
writing your own committer.
   
   `mergePaths()` was one of the most obtuse bits of code I've ever had to step 
through with a debugger and a pen and paper. And I've stepped through how 
Windows NT boots.
   
   You are proposing using reflection to get at private superclass methods that 
are not part of the public API, where we aren't even going to be aware of the 
fact that they are being used this way and have no guarantees of stability 
whatsoever.
   
   
   And of course, for anything which implement the new superclass, 
`PathOutputCommitProtocol` it's not going to work.
   
   
   Overall then doing things during commit operations for which it is going to 
be very hard to prove correctness -and vulnerable two people doing things in 
the MR codebase we could fundamentally break things, either a link time, or 
worse, in the semantics of the operations you are trying to perform.
   
   I don't see an easy solution here, but as someone with write access to the 
MR codebase, I'm happy to discuss how we could make things easier.
   
   For example, someone could actually pull out the `mergePath` code into its 
own class, with its own tests which could then be reused. We can also look 
about whether we could extend `PathOutputCommitter` to offer more here. However 
mergePath contains some big assumptions about the nature of the destination and 
what the commit algorithms do.
   
   In ideal world, how we commit should be opaque to the caller. You commit a 
task and its work is promoted into the set of committed tasks work, or you 
don't get a response from the task's process, in which case you can safely 
commit another task, confident that irrespective of what has gone wrong with 
the task which failed/partitioned/hung only one tasks output will ever become 
visible -and after the job has been committed or aborted the output of a 
partitioned task will *never* appear. V2 doesn't offer those guarantees, which 
is why I consider it broken. 
   
   One thing I have thought about adding to `PathOutputCommitter` is a 
predicate `isTaskCommitRepeatable()` which would at least tell the spark 
driver, MR AM etc whether it is safe to attempt to retry a task if it failed 
what timed out during that commit. the V2 protocol would say false, as should 
the new EMR spark committer. Not sure if that would help here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] steveloughran commented on issue #25863: [SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table.

Reply via email to