[ 
https://issues.apache.org/jira/browse/FLINK-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008719#comment-15008719
 ] 

ASF GitHub Bot commented on FLINK-3011:
---------------------------------------

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/1369

    [FLINK-3011, 3019, 3028] Cancel jobs in RESTARTING state

    This addresses issues with cancelling jobs, which are in the `RESTARTING` 
state. A job enters this state  after a failure as soon as all job vertices are 
in their final state. It then stays in this state until it is redeployed (e.g. 
default 100s currently). In this state, the job cannot be cancelled. If the 
failure is permanent (for example missing slots), the job can never be 
cancelled.
    
    This PR includes changes to the ExecutionGraph and to the clients:
    
    **ExecutionGraph** (FLINK-3011)
    - Remove the state transition from `FAILED` to `RESTARTING` in `restart()`. 
This was breaking the semantics of `FAILED` being a terminal state. It was only 
relevant for a test as far as I can tell.
    - When cancelling during restarts, two job states are relevant:
      - `RESTARTING`: try to set the state directly to `CANCELED` as all 
vertices have been already failed when the job enters the `RESTARTING` state. 
If the state transition to `CANCELED` succeeds, the restart will be ignored 
with a log message.
      - `FAILING`: try to set the state to `CANCELLING` and wait for the 
failing of the vertices to finish. This will finish the cancellation as usual 
in `jobVertexInFinalState()`. 
    
    When reviewing the `cancel()`, `jobVertexInFinalState()`, and `restart()` 
methods are relevant.
    
    **CLIFrontend** (FLINK-3019)
    - List restarting jobs with scheduled jobs
    
    ```
    $ bin/flink list
    No running jobs.
    ---------------- Scheduled/Restarting Jobs -------------------
    17.11.2015 15:14:01 : 4b3fa06c88e5a2a4963241e7afca7b7d : Streaming 
WordCount (RESTARTING)
    --------------------------------------------------------------
    ```
    
    **WebFrontend** (FLINK-3028)
    - Show the cancel button if the job is restarting. It was only displayed 
for running or created jobs before.
    
    ---
    
    I want to merge this for 0.10.1 and 1.0.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 3011-restart

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1369
    
----
commit 0c5a3306808bec5b9a833703adbcd9f45bbe6de5
Author: Ufuk Celebi <[email protected]>
Date:   2015-11-16T15:18:20Z

    [FLINK-3011] [runtime] Disallow ExecutionGraph state transition from FAILED 
to RESTARTING
    
    Removes the possibility to go from FAILED state back to RESTARTING. This 
was only used in a test
    case. It was a breaking the terminal state semantics of the FAILED state.

commit 19c602b2ce7686237d8611645a4662aa2b2a0cef
Author: Ufuk Celebi <[email protected]>
Date:   2015-11-17T10:40:54Z

    [FLINK-3011] [runtime, tests] Translate ExecutionGraphRestartTest to Java

commit e13dd1bac7029af6ae4157af226131a10f5d02d0
Author: Ufuk Celebi <[email protected]>
Date:   2015-11-17T10:56:42Z

    [FLINK-3011] [runtime] Fix cancel during restart

commit 657e34f31fe9c6325900f42c36257b5c5d2019be
Author: Ufuk Celebi <[email protected]>
Date:   2015-11-17T13:11:44Z

    [FLINK-3019] [client] List restarting jobs with scheduled jobs

commit 8b2850610aff1197d204bdb7d790df8fb6b5df4c
Author: Ufuk Celebi <[email protected]>
Date:   2015-11-17T13:51:15Z

    [FLINK-3028] [runtime-web] Show cancel button for restarting jobs

----


> Cannot cancel failing/restarting streaming job from the command line
> --------------------------------------------------------------------
>
>                 Key: FLINK-3011
>                 URL: https://issues.apache.org/jira/browse/FLINK-3011
>             Project: Flink
>          Issue Type: Bug
>          Components: Command-line client
>    Affects Versions: 0.10.0, 1.0.0
>            Reporter: Gyula Fora
>            Assignee: Ufuk Celebi
>            Priority: Critical
>
> I cannot seem to be able to cancel a failing/restarting job from the command 
> line client. The job cannot be rescheduled so it keeps failing:
> The exception I get:
> 13:58:11,240 INFO  org.apache.flink.runtime.jobmanager.JobManager             
>    - Status of job 0c895d22c632de5dfe16c42a9ba818d5 (player-id) changed to 
> RESTARTING.
> 13:58:25,234 INFO  org.apache.flink.runtime.jobmanager.JobManager             
>    - Trying to cancel job with ID 0c895d22c632de5dfe16c42a9ba818d5.
> 13:58:25,561 WARN  akka.remote.ReliableDeliverySupervisor                     
>    - Association with remote system [akka.tcp://[email protected]:42012] has 
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to