[jira] [Updated] (FLINK-8559) Exceptions in RocksDBIncrementalSnapshotOperation#takeSnapshot cause job to get stuck

Chesnay Schepler (JIRA) Mon, 05 Feb 2018 03:37:30 -0800

     [ 
https://issues.apache.org/jira/browse/FLINK-8559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chesnay Schepler updated FLINK-8559:
------------------------------------
    Description: 
In the {{RocksDBStatebackend#snapshotIncrementally}} we can find this code
 
{code:java}
final RocksDBIncrementalSnapshotOperation<K> snapshotOperation =
        new RocksDBIncrementalSnapshotOperation<>(
                this,
                checkpointStreamFactory,
                checkpointId,
                checkpointTimestamp);

snapshotOperation.takeSnapshot();

return new FutureTask<KeyedStateHandle>(
        new Callable<KeyedStateHandle>() {
                @Override
                public KeyedStateHandle call() throws Exception {
                        return snapshotOperation.materializeSnapshot();
                }
        }
) {
        @Override
        public boolean cancel(boolean mayInterruptIfRunning) {
                snapshotOperation.stop();
                return super.cancel(mayInterruptIfRunning);
        }

        @Override
        protected void done() {
                snapshotOperation.releaseResources(isCancelled());
        }
};
{code}

In the constructor of RocksDBIncrementalSnapshotOperation we call 
{{aquireResource()}} on the RocksDB {{ResourceGuard}}. If 
{{snapshotOperation.takeSnapshot()}} fails with an exception these resources 
are never released. When the task is shutdown due to the exception it will get 
stuck on releasing RocksDB.

  was:
The 
{color:#333333}{{testCheckpointedStreamingProgramIncrementalRocksDB}}{color} 
test in {color:#333333}{{JobManagerHACheckpointRecoveryITCase}}{color} runs 
indefinitely on Windows.

 

The snapshotting fails for one of 2 tasks due to FLINK-8557, but the job never 
enters a failure state. The task shutdown is stuck on releasing the RocksDB 
resources.


> Exceptions in RocksDBIncrementalSnapshotOperation#takeSnapshot cause job to 
> get stuck
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-8559
>                 URL: https://issues.apache.org/jira/browse/FLINK-8559
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing, Tests
>    Affects Versions: 1.5.0
>            Reporter: Chesnay Schepler
>            Priority: Blocker
>
> In the {{RocksDBStatebackend#snapshotIncrementally}} we can find this code
>  
> {code:java}
> final RocksDBIncrementalSnapshotOperation<K> snapshotOperation =
>       new RocksDBIncrementalSnapshotOperation<>(
>               this,
>               checkpointStreamFactory,
>               checkpointId,
>               checkpointTimestamp);
> snapshotOperation.takeSnapshot();
> return new FutureTask<KeyedStateHandle>(
>       new Callable<KeyedStateHandle>() {
>               @Override
>               public KeyedStateHandle call() throws Exception {
>                       return snapshotOperation.materializeSnapshot();
>               }
>       }
> ) {
>       @Override
>       public boolean cancel(boolean mayInterruptIfRunning) {
>               snapshotOperation.stop();
>               return super.cancel(mayInterruptIfRunning);
>       }
>       @Override
>       protected void done() {
>               snapshotOperation.releaseResources(isCancelled());
>       }
> };
> {code}
> In the constructor of RocksDBIncrementalSnapshotOperation we call 
> {{aquireResource()}} on the RocksDB {{ResourceGuard}}. If 
> {{snapshotOperation.takeSnapshot()}} fails with an exception these resources 
> are never released. When the task is shutdown due to the exception it will 
> get stuck on releasing RocksDB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-8559) Exceptions in RocksDBIncrementalSnapshotOperation#takeSnapshot cause job to get stuck

Reply via email to