[jira] [Commented] (FLINK-7757) RocksDB lock is too strict and can block snapshots in synchronous phase

ASF GitHub Bot (JIRA) Tue, 10 Oct 2017 03:24:29 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198477#comment-16198477
 ]


ASF GitHub Bot commented on FLINK-7757:
---------------------------------------

Github user aljoscha commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4764#discussion_r143685204
  
    --- Diff: 
flink-contrib/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBKeyedStateBackend.java
 ---
    @@ -618,19 +600,8 @@ public void releaseSnapshotResources() {
                                IOUtils.closeQuietly(readOptions);
                                readOptions = null;
                        }
    -           }
     
    -           /**
    -            * Drop the created snapshot if we have ben cancelled.
    -            */
    -           public void dropSnapshotResult() {
    -                   if (null != snapshotResultStateHandle) {
    -                           try {
    -                                   
snapshotResultStateHandle.discardState();
    --- End diff --
    
    Cleanup is now handled somewhere else?


> RocksDB lock is too strict and can block snapshots in synchronous phase
> -----------------------------------------------------------------------
>
>                 Key: FLINK-7757
>                 URL: https://issues.apache.org/jira/browse/FLINK-7757
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.2, 1.3.2
>            Reporter: Stefan Richter
>            Assignee: Stefan Richter
>            Priority: Blocker
>             Fix For: 1.4.0
>
>
> {{RocksDBKeyedStateBackend}} uses a lock to guard the db instance against 
> disposal of the native resources while some parallel threads might still 
> access db, which might otherwise lead to segfaults.
> Unfortunately, this locking is a bit to strict and can lead to situations 
> where snapshots block the pipeline. This can happen when a snapshot s1 is 
> running and somewhere blocking in IO while holding the guarding lock. A 
> second snapshot s2 can be triggered in parallel and requires to hold the lock 
> in the synchronous part to get a snapshot from db. As s1 is still holding on 
> to the lock, s2 can block here and stop the operator from processing further 
> elements.
> A simple solution could remove lock acquisition from the synchronous phase, 
> because both, synchronous phase and disposing the backend are only allowed to 
> be triggered from the thread that also drives element processing.
> A better solution would be to remove long sections under the lock all 
> together, because as of now they will always prevent the possibility of 
> parallel checkpointing. I think a guard for the rocksdb instance would be 
> sufficient that blocks disposal for as long as there are still clients 
> potentially accessing the instance in parallel. This could be realized by 
> keeping a synchronized counter for active clients and block disposal until 
> the client count drops to zero.
> This approach could also be integrated with triggering timers, which have 
> always been problematic in the disposal phase are currently unregulated. In 
> the new model, they could register as yet another client.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7757) RocksDB lock is too strict and can block snapshots in synchronous phase

Reply via email to