The fact that you only see this on one job is pretty clearly evidence that
we are seeing a hang of some kind due something a specific connector or
connection is doing.

I'm going to have to guess wildly here to focus us on a productive path.
What I want to rule out is a case where the connector hangs while
establishing a connection.  If this can happen then I could well believe
there would be a train wreck.  Is this something you can confirm or
disprove?

Karl


On Thu, Dec 9, 2021 at 9:07 AM Julien Massiera <
julien.massi...@francelabs.com> wrote:

> Actually, I have several jobs, but only one job is running at a time,
> and currently the error always happens on the same one. The problem is
> that I can't access the environment in debug mode, I also can't activate
> debug log because I am limited in log size, so the only thing I can do,
> is to add specific logs in specific places in the code to try to
> understand what is happening. Where would you suggest me to add log
> entries to optimise our chances to spot the issue ?
>
> Julien
>
> Le 09/12/2021 à 13:27, Karl Wright a écrit :
> > The large number of connections can happen but usually that means
> something
> > is stuck somewhere and there is a "train wreck" of other locks getting
> > backed up.
> >
> > If this is completely repeatable then I think we have an opportunity to
> > figure out why this is happening.  One thing that is clear is that this
> > doesn't happen in other situations or in our integration tests, so that
> > makes it necessary to ask what you may be doing differently here?
> >
> > I was operating on the assumption that the session just expires from lack
> > of use, but in this case it may well be the other way around: something
> > hangs elsewhere and a lock is held open for a very long time, long enough
> > to exceed the timeout.  If you have dozens of jobs running it might be a
> > challenge to do this but if you can winnow it down to a small number the
> > logs may give us a good picture of what is happening.
> >
> > Karl
> >
> >
> >
> >
> > On Wed, Dec 8, 2021 at 3:55 PM Julien Massiera <
> > julien.massi...@francelabs.com> wrote:
> >
> >> Hi,
> >>
> >> after having increased the session lifetime by 3, the lock error still
> >> happens and the MCF agent hangs, so all my jobs also hang.
> >>
> >> Also, as I said in the other thread today, I notice a very large amount
> >> of simultaneous connections from the agent to Zookeeper (more than 1000)
> >> and I cannot tell if it is normal or not.
> >>
> >> Can we ignore that particular error and avoid to block an entire MCF
> node ?
> >>
> >> Julien
> >>
> >> Le 07/12/2021 à 22:15, Julien Massiera a écrit :
> >>> Ok that makes sense. But still, I don't understand how the "Can't
> >>> release lock we don't hold" exception can happen, knowing for sure
> >>> that neither the Zookeeper process or the MCF agent process have been
> >>> down and/or restarted. Not sure that increasing the session lifetime
> >>> would solve that particular issue, and since I have no use case to
> >>> easily reproduct it, it is very complicated to debug.
> >>>
> >>> Julien
> >>>
> >>> Le 07/12/2021 à 19:08, Karl Wright a écrit :
> >>>> What this code is doing is interpreting exceptions back from
> Zookeeper.
> >>>> There are some kinds of exceptions it interprets as "session has
> >>>> expired",
> >>>> so it rebuilds the session.
> >>>>
> >>>> The code is written in such a way that the locks are presumed to
> persist
> >>>> beyond the session.  In fact, if they do not persist beyond the
> session,
> >>>> there is a risk that proper locks won't be enforced.
> >>>>
> >>>> If I recall correctly, we have a number of integration tests that
> >>>> exercise
> >>>> Zookeeper integration that are meant to allow sessions to expire and
> be
> >>>> re-established.  If what you say is true and information is attached
> >>>> solely
> >>>> to a session, Zookeeper cannot possibly work as the cross-process lock
> >>>> mechanism we use it for.  And yet it is used not just by us in this
> way,
> >>>> but by many other projects as well.
> >>>>
> >>>> So I think that the diagnosis that nodes in Zookeeper have session
> >>>> affinity
> >>>> is not absolutely correct. It may be the case that only one session
> >>>> *owns*
> >>>> a node, and if that session expires then the node goes away.  In that
> >>>> case
> >>>> I think the right approach is the modify the zookeeper parameters to
> >>>> increase the session lifetime; I don't see any other way to prevent
> bad
> >>>> things from happening.  Presumably, if a session is created within a
> >>>> process, and the process dies, the session does too.
> >>>>
> >>>> Kar
> >>>>
> >>>>
> >>>> On Tue, Dec 7, 2021 at 11:54 AM Julien Massiera <
> >>>> julien.massi...@francelabs.com> wrote:
> >>>>
> >>>>> Karl,
> >>>>>
> >>>>> I tried to understand the Zookeeper lock logic in the code, and the
> >>>>> only
> >>>>> thing I don't understand is the 'handleEphemeralNodeKeeperException'
> >>>>> method that is called in the catch(KeeperException e) of every
> >>>>> obtain/release lock method of the ZookeeperConnection class.
> >>>>>
> >>>>> This method sets the lockNode param to 'null', recreates a session
> and
> >>>>> recreates nodes but do not resets the lockNode param at the end. So,
> as
> >>>>> I understood it, if it happens it may result in the lock release
> error
> >>>>> that I mentioned because this error is triggered when the lockNode
> >>>>> param
> >>>>> is 'null'.
> >>>>>
> >>>>> The method is in the class
> >>>>> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection. If you
> can
> >>>>> take a look and tell me what you think about it, it would be great !
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Julien
> >>>>>
> >>>>> Le 07/12/2021 à 14:40, Julien Massiera a écrit :
> >>>>>> Yes, I will then try the patch and see if it is working
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> Julien
> >>>>>>
> >>>>>> Le 07/12/2021 à 14:28, Karl Wright a écrit :
> >>>>>>> Yes, this is plausible.  But I'm not sure what the solution is.
> If a
> >>>>>>> zookeeper session disappears, according to the documentation
> >>>>>>> everything
> >>>>>>> associated with that session should also disappear.
> >>>>>>>
> >>>>>>> So I guess we could catch this error and just ignore it, assuming
> >>>>>>> that the
> >>>>>>> session must be gone anyway?
> >>>>>>>
> >>>>>>> Karl
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Dec 7, 2021 at 8:21 AM Julien Massiera <
> >>>>>>> julien.massi...@francelabs.com> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> the Zookeeper lock error mentioned in the before last comment of
> >>>>>>>> this
> >>>>>>>> issue https://issues.apache.org/jira/browse/CONNECTORS-1447:
> >>>>>>>>
> >>>>>>>> FATAL 2017-08-04 09:28:25,855 (Agents idle cleanup thread) - Error
> >>>>>>>> tossed:
> >>>>>>>> Can't release lock we don't hold
> >>>>>>>> java.lang.IllegalStateException: Can't release lock we don't hold
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperConnection.releaseLock(ZooKeeperConnection.java:815)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearLock(ZooKeeperLockObject.java:218)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.ZooKeeperLockObject.clearGlobalWriteLockNoWait(ZooKeeperLockObject.java:100)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.LockObject.clearGlobalWriteLock(LockObject.java:160)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.LockObject.leaveWriteLock(LockObject.java:141)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.LockGate.leaveWriteLock(LockGate.java:205)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWrite(BaseLockManager.java:1224)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.lockmanager.BaseLockManager.leaveWriteLock(BaseLockManager.java:771)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.connectorpool.ConnectorPool$Pool.pollAll(ConnectorPool.java:670)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.core.connectorpool.ConnectorPool.pollAllConnectors(ConnectorPool.java:338)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.agents.transformationconnectorpool.TransformationConnectorPool.pollAllConnectors(TransformationConnectorPool.java:121)
> >>
> >>>>>
> >>>>>>>> at
> >>>>>>>>
> >>
> org.apache.manifoldcf.agents.system.IdleCleanupThread.run(IdleCleanupThread.java:91)
> >>
> >>>>>
> >>>>>>>> is still happening in 2021 with the 2.20 version of MCF.
> >>>>>>>>
> >>>>>>>> Karl, you hypothesized that it could be related to Zookeeper being
> >>>>>>>> restarted while the MCF agent is still running, but after some
> >>>>>>>> investigations, my theory is that it is related to re-established
> >>>>>>>> sessions. Locks are not associated to a process but to a session,
> >>>>>>>> and it
> >>>>>>>> could happen that when a session is closed accidentally
> >>>>>>>> (interrupted by
> >>>>>>>> exceptions etc), it does not correctly release the locks it sets.
> >>>>>>>> When a
> >>>>>>>> new session is created by Zookeeper for the same client, the locks
> >>>>>>>> cannot be released because they belong to an old session and the
> >>>>>>>> exception is thrown !
> >>>>>>>>
> >>>>>>>> Is it something plausible for you ? I have no knowledge on
> Zookeeper
> >>>>>>>> but
> >>>>>>>> if it is something plausible, then it is worth investigating into
> >>>>>>>> the
> >>>>>>>> code to see if everything is correctly done to be sure that all
> >>>>>>>> locks
> >>>>>>>> are released when a session is closed/interrupted by a problem.
> >>>>>>>>
> >>>>>>>> Julien
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> L'absence de virus dans ce courrier électronique a été vérifiée
> >>>>>>>> par le
> >>>>>>>> logiciel antivirus Avast.
> >>>>>>>> https://www.avast.com/antivirus
> >>>>>>>>
> >>>>> --
> >>>>> L'absence de virus dans ce courrier électronique a été vérifiée par
> le
> >>>>> logiciel antivirus Avast.
> >>>>> https://www.avast.com/antivirus
> >>>>>
> >>>>>
>

Reply via email to