Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/3340
  
    There is a lot of good code in this PR.
    
    What I would suggest to make different is to NOT make `Execution`, 
`IntermediateResult` partition, etc mutable. There is a big benefit to having 
them immutable, both for safety against accidental bugs, and also from a 
concurrency standpoint. Regular mutable non-volatile variables have visibility 
problems in concurrent access, which leads to very subtle bugs that are hard to 
trace and recover: 
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/package-summary.html#MemoryVisibility
    
    For `Execution`, I would change it such that the `ExecutionVertex` swaps 
the current `Execution` (in state `RECONCILING`) with the a new `Execution` 
that has all the fields properly set. That way there is a quasi atomic 
replacement of the previous execution, which is safer and easier to reason 
about.
    
    It may be worth to even create the JobManager (from the JobManagerRunner) 
and the ExecutionGraph directly in reconciliation mode. That way, you don't 
have to worry what to do with TaskManagers that register while the JobMaster 
has not yet switched to reconciliation.
    
    Can you explain a bit why the reconciliation need the two phases:
      1. Receive reports, set the execution attempt and state
      2. Once all are there, transition the state to running
    
    Could it be done such that on report the `Execution` is directly set to the 
reported state? Then once all reports are there, everything is just fine. After 
the reconciliation timeout expires, you would just iterate over all executions 
and `fail()` the ones that are still in `RECONCILE` - recovery logic takes care 
of the rest.
    That may also handle the case where some already reconciled tasks fail 
while reconciliation still happens. The ExecutionGraph would already recover 
the local tasks that are affected by the failure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to