keith-turner commented on a change in pull request #2329:
URL: https://github.com/apache/accumulo/pull/2329#discussion_r740260185
##########
File path:
core/src/main/java/org/apache/accumulo/fate/zookeeper/DistributedReadWriteLock.java
##########
@@ -218,22 +237,54 @@ public boolean tryLock() {
}
SortedMap<Long,byte[]> entries = qlock.getEarlierEntries(entry);
Iterator<Entry<Long,byte[]>> iterator = entries.entrySet().iterator();
- if (!iterator.hasNext())
+ if (!iterator.hasNext()) {
throw new IllegalStateException("Did not find our own lock in the
queue: " + this.entry
+ " userData " + new String(this.userData, UTF_8) + " lockType " +
lockType());
- return iterator.next().getKey().equals(entry);
+ }
+ if (!failBlockers) {
+ return iterator.next().getKey().equals(entry);
+ } else {
+ ZooStore<DistributedReadWriteLock> zs;
+ try {
+ zs = new ZooStore<>(zooPath, zrw);
+ } catch (KeeperException | InterruptedException e1) {
+ log.error("Error creating zoo store", e1);
+ return false;
+ }
+ final AdminUtil<DistributedReadWriteLock> util = new AdminUtil<>();
+ boolean result = true;
+ while (iterator.hasNext()) {
+ Entry<Long,byte[]> e = iterator.next();
+ if (!e.getKey().equals(entry)) {
+ result &= util.prepFail(zs, zrw, zooManagerPath,
Long.toString(e.getKey(), 16));
Review comment:
I see two problems with the call here. First prepFail() calls
checkGlobalLock() which calls System.exit() when the constuctor AdminUtil<>()
is called. The reason it calles checkGlobalLock is because I think this code
was only intended to be called when the manager was not running and no repos
were running. The Second problem is that prepFail only initiates failure, it
does not wait for any operations that may currently be actively executing and
holding the lock to finish. This is why prepFail checks that the Manager is
not running, because it does not deal with this concurrency issue any way. So
maybe something could get the lock by failing other operations, but those
failed operation could still be running and mutating the metadata table and
zookeeper after the lock is acquired.
The first problem is easy to address by calling a diff constructor, but the
2nd problem would require some new mechanism to initiate failure and wait for
anything that may be running. Since all fate ops are run in the manager could
have something that does the following.
- Prevents repos related to failed txids from starting to run (like if it
was sitting a thread pool queue)
- Waits until repos related to failed txids that are currently running
complete.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]