Github user markhamstra commented on the pull request:
https://github.com/apache/spark/pull/639#issuecomment-42154867
Okay, first, I didn't start work on this JIRA myself because I haven't had
the time to come to a complete understanding of the current and intended
functionality, nor of the scope of the problem from running completeRecovery()
on an Akka scheduler thread, so you shouldn't assume that the JIRA fully
describes the problem or that solving just what it describes constitutes a
complete fix. For example, any exception currently thrown by
completeRecovery() when run by the scheduler will not be caught because of
SPARK-1620, so a complete fix to the use of the scheduler from `case
ElectedLeader` will also have to incorporate whatever we end up doing to fix
that issue.
Second, what you've done in this PR significantly changes the results of
calling completeRecovery(), and there is at least a minor problem in doing so.
Previously, schedule() would be called synchronously after setting
RecoveryState.ALIVE and before logging the completion of recovery. Now you're
causing recovery() to be called asynchronously when the TriggerSchedule event
makes its way to the head of the message queue. That will potentially be long
after RecoveryState.ALIVE has been set, the completion of recovery has been
logged, and other events are processed with the new RecoveryState even though
schedule() has not yet been run. The out-of-order logging is at least a minor
problem. The processing of other events before schedule() has been run is
potentially a major problem -- but like I said, I haven't yet worked through
exactly how things are supposed to work, so I'm also not certain how big of a
problem this PR would create.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---