[
https://issues.apache.org/jira/browse/JCR-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212092#comment-14212092
]
Benjamin Papez commented on JCR-3738:
-------------------------------------
We are also seeing those locks frequently when a clustered server is under
load. We are however running the tests with a custom jackrabbit version (v2.4.5
and several backports from 2.6.x) and on MySQL, but I think the problem is
still the same and unsolved on latest jackrabbit.
To get better information I have also patched the Jackrabbit sources to add
more logging and trigger automatic thread dumps (therefore line numbers in the
stacktrace don't match with sources in the source repository). The key to the
problem is that sometimes the LOCAL_REVISION table update is done under
transaction (most of the time it is done with autocommit=true). So I added
automatic thread dump when the LOCAL_REVISION update is about to be done under
TX and another one when we run into the timeout exception from the database.
Please look at the attached extended-log-with-dumps.txt (I have also excluded
non-essential stack frames and logs in there).
It looks like:
1.) Thread "scheduler_Worker-1" is in SharedItemStateManager$Update.end calling
ClusterNode$WorkspaceUpdateChannel.updateCommitted . There in
AppendRecord.update() the DatabaseJournal is unlocked - batch is ended and
writeLock in AbstractJournal is released. The next method call:
setRevision(record.getRevision()); does the update on LOCAL_REVISION with a DB
connection having autocommit=true , however it is stuck, because
2.) meanwhile another thread "http-bio-8080-exec-244" doing a NodeImpl.unlock
gets the AbstractJournal writeLock, starts a new batch on DatabaseJournal and
in doSynch updates the LOCAL_REVISION table within a transaction. This is done
inside SharedItemStateManager$Update.begin calling
eventChannel.updateCreated(this). However that thread cannot continue up to the
DB commit, because it then gets stuck in the next call, when acquiring the
writeLock within DefaultISMLocking.acquireWriteLock, where it goes into wait()
in:
while (writerId != null
? !isSameThreadId(writerId, currentId) : readerCount > 0) {
wait();
}
3.) Only after "scheduler_Worker-1" runs into the first timeout and retries the
SQL command just to run in another DB timeout throwing out the exception, then
thread "http-bio-8080-exec-244" starts to continue. So I assume that
"scheduler_Worker-1" was holding that ISM lock, "http-bio-8080-exec-244" was
waiting for and thus creating that deadlock.
This happens despite the previous call to downgrade the writeLock in
SharedItemStateManager$Update.end :
try {
// downgrade to read lock
readLock = writeLock.downgrade();
...
} finally {
...
eventChannel.updateCommitted(this, path);
What I do not understand is why in DefaultISMLocking.releaseWriteLock(boolean
downgrade) the writerId is only set to null, if readerCount == 0. Because of
that the other thread is stuck in the wait when acquiring the write lock:
while (writerId != null
? !isSameThreadId(writerId, currentId) : readerCount > 0) {
wait();
}
> CLONE - Deadlock on LOCAL_REVISION table in clustering environment
> ------------------------------------------------------------------
>
> Key: JCR-3738
> URL: https://issues.apache.org/jira/browse/JCR-3738
> Project: Jackrabbit Content Repository
> Issue Type: Bug
> Components: clustering
> Affects Versions: 2.6.2
> Environment: CQ5.6.1 with jackrabbit-core 2.6.2 backed off ibm db2
> v10.5
> Reporter: Ankush Malhotra
> Priority: Critical
> Attachments: before-lock.zip, db-deadlock-info.txt,
> extended-log-with-dumps.txt, stat-cache.log, threaddumps.zip
>
>
> Original, cloned description:
> > When inserting a lot of nodes concurrently (100/200 threads) the system
> > hangs generating a deadlock on the LOCAL_REVISION table.
> > There is a thread that starts a transaction but the transaction remains
> > open, while another thread tries to acquire the lock on the table.
> > This actually happen even if there is only a server up but configured in
> > cluster mode.
> > I found that in AbstractJournal, we try to write the LOCAL_REVISION even if
> > we don't sync any record because they're generated by the same journal of
> > the thread running.
> >
> > Removing this unnecessary (to me :-) ) write to the LOCAL_REVISION table,
> > remove the deadlock.
> This might not be the exact same case with this issue. See the attached
> thread dumps etc. for full details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)