[jira] [Commented] (FLINK-7757) RocksDB lock is too strict and can block snapshots in synchronous phase

ASF GitHub Bot (JIRA) Wed, 18 Oct 2017 01:09:35 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208953#comment-16208953
 ]


ASF GitHub Bot commented on FLINK-7757:
---------------------------------------

Github user StefanRRichter commented on a diff in the pull request:

    https://github.com/apache/flink/pull/4764#discussion_r145341988
  
    --- Diff: 
flink-core/src/test/java/org/apache/flink/util/ResourceGuardTest.java ---
    @@ -0,0 +1,135 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.flink.util;
    +
    +import org.junit.Assert;
    +import org.junit.Test;
    +
    +import java.io.IOException;
    +import java.util.concurrent.atomic.AtomicBoolean;
    +
    +public class ResourceGuardTest {
    +
    +   @Test
    +   public void testClose() {
    +           ResourceGuard resourceGuard = new ResourceGuard();
    +           Assert.assertFalse(resourceGuard.isClosed());
    +           resourceGuard.close();
    +           Assert.assertTrue(resourceGuard.isClosed());
    +           try {
    +                   resourceGuard.acquireResource();
    +                   Assert.fail();
    +           } catch (IOException ignore) {
    +           }
    +   }
    +
    +   @Test
    +   public void testAcquireReleaseClose() throws IOException {
    +           ResourceGuard resourceGuard = new ResourceGuard();
    +           ResourceGuard.Lease lease = resourceGuard.acquireResource();
    +           Assert.assertEquals(1, resourceGuard.getLeaseCount());
    +           lease.close();
    +           Assert.assertEquals(0, resourceGuard.getLeaseCount());
    +           resourceGuard.close();
    +           Assert.assertTrue(resourceGuard.isClosed());
    +   }
    +
    +   @Test
    +   public void testCloseBlockIfAcquired() throws Exception {
    +           ResourceGuard resourceGuard = new ResourceGuard();
    +           ResourceGuard.Lease lease_1 = resourceGuard.acquireResource();
    +           AtomicBoolean checker = new AtomicBoolean(true);
    +
    +           Thread closerThread = new Thread() {
    +                   @Override
    +                   public void run() {
    +                           try {
    +                                   // this line should block until all 
acquires are matched by releases.
    +                                   resourceGuard.close();
    +                                   checker.set(false);
    +                           } catch (Exception ignore) {
    +                                   checker.set(false);
    +                           }
    +                   }
    +           };
    +
    +           closerThread.start();
    +
    +           ResourceGuard.Lease lease_2 = resourceGuard.acquireResource();
    +           lease_2.close();
    +           Thread.sleep(50);
    --- End diff --
    
    I don't like them either. I will drop them because they are not actually 
adding real value and the test is clearer without them.


> RocksDB lock is too strict and can block snapshots in synchronous phase
> -----------------------------------------------------------------------
>
>                 Key: FLINK-7757
>                 URL: https://issues.apache.org/jira/browse/FLINK-7757
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.2, 1.3.2
>            Reporter: Stefan Richter
>            Assignee: Stefan Richter
>            Priority: Blocker
>             Fix For: 1.4.0
>
>
> {{RocksDBKeyedStateBackend}} uses a lock to guard the db instance against 
> disposal of the native resources while some parallel threads might still 
> access db, which might otherwise lead to segfaults.
> Unfortunately, this locking is a bit to strict and can lead to situations 
> where snapshots block the pipeline. This can happen when a snapshot s1 is 
> running and somewhere blocking in IO while holding the guarding lock. A 
> second snapshot s2 can be triggered in parallel and requires to hold the lock 
> in the synchronous part to get a snapshot from db. As s1 is still holding on 
> to the lock, s2 can block here and stop the operator from processing further 
> elements.
> A simple solution could remove lock acquisition from the synchronous phase, 
> because both, synchronous phase and disposing the backend are only allowed to 
> be triggered from the thread that also drives element processing.
> A better solution would be to remove long sections under the lock all 
> together, because as of now they will always prevent the possibility of 
> parallel checkpointing. I think a guard for the rocksdb instance would be 
> sufficient that blocks disposal for as long as there are still clients 
> potentially accessing the instance in parallel. This could be realized by 
> keeping a synchronized counter for active clients and block disposal until 
> the client count drops to zero.
> This approach could also be integrated with triggering timers, which have 
> always been problematic in the disposal phase are currently unregulated. In 
> the new model, they could register as yet another client.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-7757) RocksDB lock is too strict and can block snapshots in synchronous phase

Reply via email to