Github user StephanEwen commented on the pull request:
https://github.com/apache/flink/pull/459#issuecomment-77780578
Nice first prototype. From my first tries, it seems to work with respect to
checkpointing Kafka state in a pipelined (non-blocking) fashion and to reuse it.
Here are a few things I think should be addressed, to make sure that we get
high code quality for this sensible piece of the system.
- This seems to have reproducibly broken the
`TaskManagerFailureRecoveryITCase.testRestartWithFailingTaskManager`. Probably
through the additional "restart()" call in the execution graph. I am confused
why it does not break more.
- This Pull Request "sneaks" in a new job type flag, which marks jobs as
"batch" or "streaming". I think this is definitely much needed, and we should
make this more prominent.
- Some commits introduce a completely new code style. I have recently
written a mail to the mailing list with a discussion about more homogeneity in
the project. This is a good example, where there is no good reason not to stick
with the common style from the rest of the code. Makes the code much more
readable if there is a common style behind it.
- Related to the previous comment, there are quire a few unnecessary
reformattings, which actually take code that was in the original common style,
and change it to a different uncommon style.
- There are many "printStackTrace" command and logging in general is
neglected a bit. For a complex mechanism like this, it is super crucial as soon
as you need to debug a deployed case.
- Comments are missing big time. This code
- Messages received by the JobManager/TaskManager which are dedicated for
another actor can be "forwarded". That keeps the original sender, which helps
in routing answers.
- I would suggest to rename the "StreamStateMonitor" to something along
the lines of "StreamCheckpointCoordinator".
- The TaskManager and JobManager have quite a few different messages to
handle, would be nice to do more communication endpoint to endpoint (state
tracker to source). Not urgent, may be a future refactoring.
- Along the lines of getting/maintaining a more coherent code base, I
would vote to make new classes in the flink-runtime project in Java. Actors via
"UntypedActor" are not too bad in Akka/Java.
- I think the best way of desiging a resilinet system is to not make a
distinction between the first execution, a recovery execution, a rebalance
execution, ... They all should function the same way. There should be no
dedicated code paths and flags "onRecovery". The only difference here is that
initially, the state is empty, which can be modelled as state as well.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---