steveloughran commented on issue #25863: [SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table. URL: https://github.com/apache/spark/pull/25863#issuecomment-536518159 This scares me. The `FileOutputCommitter` code scares me already. It's why when I added a new committer extension point I chose not to go near that code and instead add a new superclass. It's scary code because * it's co-recursive in a way that hard to follow. * it's used in the MRv1 and V2 APIs * it implements to different commit algorithms in the same methods * there is no documentation other than that code. * one of those commit algorithms, doesn't have the failure semantics spark expects (network partition during task commit) * it's been widely subclassed in ways people aren't aware of. * there's lots of references to static methods which you can't subclass when writing your own committer. `mergePaths()` was one of the most obtuse bits of code I've ever had to step through with a debugger and a pen and paper. And I've stepped through how Windows NT boots. You are proposing using reflection to get at private superclass methods that are not part of the public API, where we aren't even going to be aware of the fact that they are being used this way and have no guarantees of stability whatsoever. And of course, for anything which implement the new superclass, `PathOutputCommitProtocol` it's not going to work. Overall then doing things during commit operations for which it is going to be very hard to prove correctness -and vulnerable two people doing things in the MR codebase we could fundamentally break things, either a link time, or worse, in the semantics of the operations you are trying to perform. I don't see an easy solution here, but as someone with write access to the MR codebase, I'm happy to discuss how we could make things easier. For example, someone could actually pull out the `mergePath` code into its own class, with its own tests which could then be reused. We can also look about whether we could extend `PathOutputCommitter` to offer more here. However mergePath contains some big assumptions about the nature of the destination and what the commit algorithms do. In ideal world, how we commit should be opaque to the caller. You commit a task and its work is promoted into the set of committed tasks work, or you don't get a response from the task's process, in which case you can safely commit another task, confident that irrespective of what has gone wrong with the task which failed/partitioned/hung only one tasks output will ever become visible -and after the job has been committed or aborted the output of a partitioned task will *never* appear. V2 doesn't offer those guarantees, which is why I consider it broken. One thing I have thought about adding to `PathOutputCommitter` is a predicate `isTaskCommitRepeatable()` which would at least tell the spark driver, MR AM etc whether it is safe to attempt to retry a task if it failed what timed out during that commit. the V2 protocol would say false, as should the new EMR spark committer. Not sure if that would help here.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
