keith-turner commented on a change in pull request #2329:
URL: https://github.com/apache/accumulo/pull/2329#discussion_r740260185



##########
File path: 
core/src/main/java/org/apache/accumulo/fate/zookeeper/DistributedReadWriteLock.java
##########
@@ -218,22 +237,54 @@ public boolean tryLock() {
       }
       SortedMap<Long,byte[]> entries = qlock.getEarlierEntries(entry);
       Iterator<Entry<Long,byte[]>> iterator = entries.entrySet().iterator();
-      if (!iterator.hasNext())
+      if (!iterator.hasNext()) {
         throw new IllegalStateException("Did not find our own lock in the 
queue: " + this.entry
             + " userData " + new String(this.userData, UTF_8) + " lockType " + 
lockType());
-      return iterator.next().getKey().equals(entry);
+      }
+      if (!failBlockers) {
+        return iterator.next().getKey().equals(entry);
+      } else {
+        ZooStore<DistributedReadWriteLock> zs;
+        try {
+          zs = new ZooStore<>(zooPath, zrw);
+        } catch (KeeperException | InterruptedException e1) {
+          log.error("Error creating zoo store", e1);
+          return false;
+        }
+        final AdminUtil<DistributedReadWriteLock> util = new AdminUtil<>();
+        boolean result = true;
+        while (iterator.hasNext()) {
+          Entry<Long,byte[]> e = iterator.next();
+          if (!e.getKey().equals(entry)) {
+            result &= util.prepFail(zs, zrw, zooManagerPath, 
Long.toString(e.getKey(), 16));

Review comment:
       I see two problems with the call here.  First prepFail() calls  
checkGlobalLock() which calls System.exit() when the constuctor AdminUtil<>() 
is called.  The reason it calles checkGlobalLock is because I think this code 
was only intended to be called when the manager was not running and no repos 
were running.  The Second problem is that prepFail only initiates failure, it 
does not wait for any operations that may currently be actively executing and 
holding the lock to finish.  This is why prepFail checks that the Manager is 
not running, because it does not deal with this concurrency issue any way.  So 
maybe something could get the lock by failing other operations, but those 
failed operation could still be running and mutating the metadata table and 
zookeeper after the lock is acquired.
   
   The first problem is easy to address by calling a diff constructor, but the 
2nd problem would require some new mechanism to initiate failure and wait for 
anything that may be running. Since all fate ops are run in the manager could 
have something that does the following.
   
   - Prevents repos related to failed txids from starting to run (like if it 
was sitting a thread pool queue)
   - Waits until repos related to failed txids that are currently running 
complete. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to