Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71253931
I'm also concerned about the performance ramifications of this. We need to
run performance benchmarks. However, the only critical path that is affected by
this are tasks that are explicitly saving to Hadoop file. When a task
completes, the DAGScheduler sends a message to the OutputCommitCoordinator
actor so the DAGScheduler is not blocked by this logic.
We do actually need the processing to be single threaded, as trying to
coordinate synchronization on the centralized arbitration logic is a bit of a
nightmare. I mean, we could allow multiple threads to access the internal state
of OutputCommitCoordinator and implement appropriate synchronization logic. I
considered an optimization where the driver broadcasts to executors when tasks
are being speculated, and the executors of the original tasks would know to
check the commit authorization, and skip it for tasks that don't have
speculated copies. There's a lot of race conditions that arise from that
though, which further underlines the need to centralize everything.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]