GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/2252

    [FLINK-3466] [runtime] Cancel state handled on state restore

    This pull request fixes the issue that state restore operations can get 
stuck when tasks are cancelled during state restore. That happens due to a bug 
in HDFS, which deadlocks (or livelocks) when the reading thread is interrupted.
    
    This introduces two things:
    
      1. All state handles and key/value snapshots are now `Closable`. This 
does not delete any checkpoint data, but simply closes pending streams and data 
fetch handles. Operations concurrently accessing the state handles state should 
fail.
    
      2. The `StreamTask` holds a set of "Closables" that it closes upon 
cancellation. This is a cleaner way of stopping in-progress work than relying 
on "interrupt()" to interrupt that work.
    
    This mechanism should eventually be extended to also cancel operators and 
state handles pending asynchronous materialization.
    
    There is a test that has an interrupt sensitive state handle (mimicking 
HDFS's deadlock behavior) that causes a stall without this pull request and 
cleanly finishes with the changes in this pull request.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink 
state_handle_cancellation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2252.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2252
    
----
commit 224503b86c2864f604a7c519ea5f415c57f35ff3
Author: Stephan Ewen <[email protected]>
Date:   2016-07-14T13:14:12Z

    [FLINK-3466] [tests] Add serialization validation for state handles

commit c411b379381ab1390e2166356232a33165c1abd9
Author: Stephan Ewen <[email protected]>
Date:   2016-07-13T19:32:40Z

    [FLINK-3466] [runtime] Make state handles cancelable.
    
    State handles are cancelable, to make sure long running checkpoint restore 
operations do
    finish early on cancallation, even if the code does not properly react to 
interrupts.
    
    This is especially important since HDFS client code is so buggy that it 
deadlocks when
    interrupted without closing.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to