Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/459#issuecomment-77780578
  
    Nice first prototype. From my first tries, it seems to work with respect to 
checkpointing Kafka state in a pipelined (non-blocking) fashion and to reuse it.
    
    Here are a few things I think should be addressed, to make sure that we get 
high code quality for this sensible piece of the system.
    
      - This seems to have reproducibly broken the 
`TaskManagerFailureRecoveryITCase.testRestartWithFailingTaskManager`. Probably 
through the additional "restart()" call in the execution graph. I am confused 
why it does not break more.
      
      - This Pull Request "sneaks" in a new job type flag, which marks jobs as 
"batch" or "streaming". I think this is definitely much needed, and we should 
make this more prominent.
        
      - Some commits introduce a completely new code style. I have recently 
written a mail to the mailing list with a discussion about more homogeneity in 
the project. This is a good example, where there is no good reason not to stick 
with the common style from the rest of the code. Makes the code much more 
readable if there is a common style behind it.
    
      - Related to the previous comment, there are quire a few unnecessary 
reformattings, which actually take code that was in the original common style, 
and change it to a different uncommon style.
    
      - There are many "printStackTrace" command and logging in general is 
neglected a bit. For a complex mechanism like this, it is super crucial as soon 
as you need to debug a deployed case.
      
      - Comments are missing big time. This code 
      
      - Messages received by the JobManager/TaskManager which are dedicated for 
another actor can be "forwarded". That keeps the original sender, which helps 
in routing answers.
      
      - I would suggest to rename the "StreamStateMonitor" to something along 
the lines of "StreamCheckpointCoordinator".
      
      - The TaskManager and JobManager have quite a few different messages to 
handle, would be nice to do more communication endpoint to endpoint (state 
tracker to source). Not urgent, may be a future refactoring.
    
      - Along the lines of getting/maintaining a more coherent code base, I 
would vote to make new classes in the flink-runtime project in Java. Actors via 
"UntypedActor" are not too bad in Akka/Java.
      
      - I think the best way of desiging a resilinet system is to not make a 
distinction between the first execution, a recovery execution, a rebalance 
execution, ... They all should function the same way. There should be no 
dedicated code paths and flags "onRecovery". The only difference here is that 
initially, the state is empty, which can be modelled as state as well.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to