The fact that you only see this on one job is pretty clearly evidence that we are seeing a hang of some kind due something a specific connector or connection is doing.
I'm going to have to guess wildly here to focus us on a productive path. What I want to rule out is a case where the connector hangs while establishing a connection. If this can happen then I could well believe there would be a train wreck. Is this something you can confirm or disprove? Karl On Thu, Dec 9, 2021 at 9:07 AM Julien Massiera < julien.massi...@francelabs.com> wrote: > Actually, I have several jobs, but only one job is running at a time, > and currently the error always happens on the same one. The problem is > that I can't access the environment in debug mode, I also can't activate > debug log because I am limited in log size, so the only thing I can do, > is to add specific logs in specific places in the code to try to > understand what is happening. Where would you suggest me to add log > entries to optimise our chances to spot the issue ? > > Julien > > Le 09/12/2021 à 13:27, Karl Wright a écrit : > > The large number of connections can happen but usually that means > something > > is stuck somewhere and there is a "train wreck" of other locks getting > > backed up. > > > > If this is completely repeatable then I think we have an opportunity to > > figure out why this is happening. One thing that is clear is that this > > doesn't happen in other situations or in our integration tests, so that > > makes it necessary to ask what you may be doing differently here? > > > > I was operating on the assumption that the session just expires from lack > > of use, but in this case it may well be the other way around: something > > hangs elsewhere and a lock is held open for a very long time, long enough > > to exceed the timeout. If you have dozens of jobs running it might be a > > challenge to do this but if you can winnow it down to a small number the > > logs may give us a good picture of what is happening. > > > > Karl > > > > > > > > > > On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera < > > julien.massi...@francelabs.com> wrote: > > > >> Hi, > >> > >> after having increased the session lifetime by 3, the lock error still > >> happens and the MCF agent hangs, so all my jobs also hang. > >> > >> Also, as I said in the other thread today, I notice a very large amount > >> of simultaneous connections from the agent to Zookeeper (more than 1000) > >> and I cannot tell if it is normal or not. > >> > >> Can we ignore that particular error and avoid to block an entire MCF > node ? > >> > >> Julien > >> > >> Le 07/12/2021 à 22:15, Julien Massiera a écrit : > >>> Ok that makes sense. But still, I don't understand how the "Can't > >>> release lock we don't hold" exception can happen, knowing for sure > >>> that neither the Zookeeper process or the MCF agent process have been > >>> down and/or restarted. Not sure that increasing the session lifetime > >>> would solve that particular issue, and since I have no use case to > >>> easily reproduct it, it is very complicated to debug. > >>> > >>> Julien > >>> > >>> Le 07/12/2021 à 19:08, Karl Wright a écrit : > >>>> What this code is doing is interpreting exceptions back from > Zookeeper. > >>>> There are some kinds of exceptions it interprets as "session has > >>>> expired", > >>>> so it rebuilds the session. > >>>> > >>>> The code is written in such a way that the locks are presumed to > persist > >>>> beyond the session. In fact, if they do not persist beyond the > session, > >>>> there is a risk that proper locks won't be enforced. > >>>> > >>>> If I recall correctly, we have a number of integration tests that > >>>> exercise > >>>> Zookeeper integration that are meant to allow sessions to expire and > be > >>>> re-established. If what you say is true and information is attached > >>>> solely > >>>> to a session, Zookeeper cannot possibly work as the cross-process lock > >>>> mechanism we use it for. And yet it is used not just by us in this > way, > >>>> but by many other projects as well. > >>>> > >>>> So I think that the diagnosis that nodes in Zookeeper have session > >>>> affinity > >>>> is not absolutely correct. It may be the case that only one session > >>>> *owns* > >>>> a node, and if that session expires then the node goes away. In that > >>>> case > >>>> I think the right approach is the modify the zookeeper parameters to > >>>> increase the session lifetime; I don't see any other way to prevent > bad > >>>> things from happening. Presumably, if a session is created within a > >>>> process, and the process dies, the session does too. > >>>> > >>>> Kar > >>>> > >>>> > >>>> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera < > >>>> julien.massi...@francelabs.com> wrote: > >>>> > >>>>> Karl, > >>>>> > >>>>> I tried to understand the Zookeeper lock logic in the code, and the > >>>>> only > >>>>> thing I don't understand is the 'handleEphemeralNodeKeeperException' > >>>>> method that is called in the catch(KeeperException e) of every > >>>>> obtain/release lock method of the ZookeeperConnection class. > >>>>> > >>>>> This method sets the lockNode param to 'null', recreates a session > and > >>>>> recreates nodes but do not resets the lockNode param at the end. So, > as > >>>>> I understood it, if it happens it may result in the lock release > error > >>>>> that I mentioned because this error is triggered when the lockNode > >>>>> param > >>>>> is 'null'. > >>>>> > >>>>> The method is in the class > >>>>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If you > can > >>>>> take a look and tell me what you think about it, it would be great ! > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Julien > >>>>> > >>>>> Le 07/12/2021 à 14:40, Julien Massiera a écrit : > >>>>>> Yes, I will then try the patch and see if it is working > >>>>>> > >>>>>> Regards, > >>>>>> > >>>>>> Julien > >>>>>> > >>>>>> Le 07/12/2021 à 14:28, Karl Wright a écrit : > >>>>>>> Yes, this is plausible. But I'm not sure what the solution is. > If a > >>>>>>> zookeeper session disappears, according to the documentation > >>>>>>> everything > >>>>>>> associated with that session should also disappear. > >>>>>>> > >>>>>>> So I guess we could catch this error and just ignore it, assuming > >>>>>>> that the > >>>>>>> session must be gone anyway? > >>>>>>> > >>>>>>> Karl > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera < > >>>>>>> julien.massi...@francelabs.com> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> the Zookeeper lock error mentioned in the before last comment of > >>>>>>>> this > >>>>>>>> issue https://issues.apache.org/jira/browse/CONNECTORS-1447: > >>>>>>>> > >>>>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error > >>>>>>>> tossed: > >>>>>>>> Can't release lock we don't hold > >>>>>>>> java.lang.IllegalStateException: Can't release lock we don't hold > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121) > >> > >>>>> > >>>>>>>> at > >>>>>>>> > >> > org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91) > >> > >>>>> > >>>>>>>> is still happening in 2021 with the 2.20 version of MCF. > >>>>>>>> > >>>>>>>> Karl, you hypothesized that it could be related to Zookeeper being > >>>>>>>> restarted while the MCF agent is still running, but after some > >>>>>>>> investigations, my theory is that it is related to re-established > >>>>>>>> sessions. Locks are not associated to a process but to a session, > >>>>>>>> and it > >>>>>>>> could happen that when a session is closed accidentally > >>>>>>>> (interrupted by > >>>>>>>> exceptions etc), it does not correctly release the locks it sets. > >>>>>>>> When a > >>>>>>>> new session is created by Zookeeper for the same client, the locks > >>>>>>>> cannot be released because they belong to an old session and the > >>>>>>>> exception is thrown ! > >>>>>>>> > >>>>>>>> Is it something plausible for you ? I have no knowledge on > Zookeeper > >>>>>>>> but > >>>>>>>> if it is something plausible, then it is worth investigating into > >>>>>>>> the > >>>>>>>> code to see if everything is correctly done to be sure that all > >>>>>>>> locks > >>>>>>>> are released when a session is closed/interrupted by a problem. > >>>>>>>> > >>>>>>>> Julien > >>>>>>>> > >>>>>>>> -- > >>>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée > >>>>>>>> par le > >>>>>>>> logiciel antivirus Avast. > >>>>>>>> https://www.avast.com/antivirus > >>>>>>>> > >>>>> -- > >>>>> L'absence de virus dans ce courrier électronique a été vérifiée par > le > >>>>> logiciel antivirus Avast. > >>>>> https://www.avast.com/antivirus > >>>>> > >>>>> >