[
https://issues.apache.org/jira/browse/FLINK-7757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208953#comment-16208953
]
ASF GitHub Bot commented on FLINK-7757:
---------------------------------------
Github user StefanRRichter commented on a diff in the pull request:
https://github.com/apache/flink/pull/4764#discussion_r145341988
--- Diff:
flink-core/src/test/java/org/apache/flink/util/ResourceGuardTest.java ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.util;
+
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicBoolean;
+
+public class ResourceGuardTest {
+
+ @Test
+ public void testClose() {
+ ResourceGuard resourceGuard = new ResourceGuard();
+ Assert.assertFalse(resourceGuard.isClosed());
+ resourceGuard.close();
+ Assert.assertTrue(resourceGuard.isClosed());
+ try {
+ resourceGuard.acquireResource();
+ Assert.fail();
+ } catch (IOException ignore) {
+ }
+ }
+
+ @Test
+ public void testAcquireReleaseClose() throws IOException {
+ ResourceGuard resourceGuard = new ResourceGuard();
+ ResourceGuard.Lease lease = resourceGuard.acquireResource();
+ Assert.assertEquals(1, resourceGuard.getLeaseCount());
+ lease.close();
+ Assert.assertEquals(0, resourceGuard.getLeaseCount());
+ resourceGuard.close();
+ Assert.assertTrue(resourceGuard.isClosed());
+ }
+
+ @Test
+ public void testCloseBlockIfAcquired() throws Exception {
+ ResourceGuard resourceGuard = new ResourceGuard();
+ ResourceGuard.Lease lease_1 = resourceGuard.acquireResource();
+ AtomicBoolean checker = new AtomicBoolean(true);
+
+ Thread closerThread = new Thread() {
+ @Override
+ public void run() {
+ try {
+ // this line should block until all
acquires are matched by releases.
+ resourceGuard.close();
+ checker.set(false);
+ } catch (Exception ignore) {
+ checker.set(false);
+ }
+ }
+ };
+
+ closerThread.start();
+
+ ResourceGuard.Lease lease_2 = resourceGuard.acquireResource();
+ lease_2.close();
+ Thread.sleep(50);
--- End diff --
I don't like them either. I will drop them because they are not actually
adding real value and the test is clearer without them.
> RocksDB lock is too strict and can block snapshots in synchronous phase
> -----------------------------------------------------------------------
>
> Key: FLINK-7757
> URL: https://issues.apache.org/jira/browse/FLINK-7757
> Project: Flink
> Issue Type: Bug
> Components: State Backends, Checkpointing
> Affects Versions: 1.2.2, 1.3.2
> Reporter: Stefan Richter
> Assignee: Stefan Richter
> Priority: Blocker
> Fix For: 1.4.0
>
>
> {{RocksDBKeyedStateBackend}} uses a lock to guard the db instance against
> disposal of the native resources while some parallel threads might still
> access db, which might otherwise lead to segfaults.
> Unfortunately, this locking is a bit to strict and can lead to situations
> where snapshots block the pipeline. This can happen when a snapshot s1 is
> running and somewhere blocking in IO while holding the guarding lock. A
> second snapshot s2 can be triggered in parallel and requires to hold the lock
> in the synchronous part to get a snapshot from db. As s1 is still holding on
> to the lock, s2 can block here and stop the operator from processing further
> elements.
> A simple solution could remove lock acquisition from the synchronous phase,
> because both, synchronous phase and disposing the backend are only allowed to
> be triggered from the thread that also drives element processing.
> A better solution would be to remove long sections under the lock all
> together, because as of now they will always prevent the possibility of
> parallel checkpointing. I think a guard for the rocksdb instance would be
> sufficient that blocks disposal for as long as there are still clients
> potentially accessing the instance in parallel. This could be realized by
> keeping a synchronized counter for active clients and block disposal until
> the client count drops to zero.
> This approach could also be integrated with triggering timers, which have
> always been problematic in the disposal phase are currently unregulated. In
> the new model, they could register as yet another client.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)