Github user mccheah commented on the pull request:

    https://github.com/apache/spark/pull/4155#issuecomment-71253931
  
    I'm also concerned about the performance ramifications of this. We need to 
run performance benchmarks. However, the only critical path that is affected by 
this are tasks that are explicitly saving to Hadoop file. When a task 
completes, the DAGScheduler sends a message to the OutputCommitCoordinator 
actor so the DAGScheduler is not blocked by this logic.
    
    We do actually need the processing to be single threaded, as trying to 
coordinate synchronization on the centralized arbitration logic is a bit of a 
nightmare. I mean, we could allow multiple threads to access the internal state 
of OutputCommitCoordinator and implement appropriate synchronization logic. I 
considered an optimization where the driver broadcasts to executors when tasks 
are being speculated, and the executors of the original tasks would know to 
check the commit authorization, and skip it for tasks that don't have 
speculated copies. There's a lot of race conditions that arise from that 
though, which further underlines the need to centralize everything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to