Github user tillrohrmann commented on the pull request:
https://github.com/apache/flink/pull/750#issuecomment-161949525
When fixing the `JobManagerTest` I noticed the following. When the job was
stopped when it was still in the state `SCHEDULED` or `DEPLOYING`, then one
received a `StoppingSuccess`. The problem was that the stop was not executed
and the job later switched to `RUNNING`.
The same can be observed if the job is in state `RESTARTING`. Stopping a
restarting job does nothing even though you receive a `StoppingSuccess`
message. The job will later be redeployed.
As a user I would expect that the job is immediately stopped or at least at
the next possible moment (e.g. when it's deployed). Or I would expect that the
system tells me that the stopping is at the moment not possible.
Similar is the question, what happens if only a subset of all sources is
deployed and in the state `RUNNING`. This would mean that the undeployed
sources won't get noticed about the stopping signal and, thus, be normally
deployed.
Furthermore, what happens if the `stop` method of the `SourceFunction`
throws an unchecked exception? If I'm not mistaken, then this will only get
logged. But shouldn't the task be cancelled in such a situation because the
state cannot be guaranteed to be consistent anymore?
The case that a `Task` is not `Stoppable` and that a `Task` cannot be found
on the `TaskManager` are treated by the `Execution` identically. Both cases
cause a `TaskOperationResult(executionID, false, message)` to be sent back to
the `Execution`. There it will be logged that the stopping call "did not find
the task". I think it would be good to differentiate the two cases.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---