[jira] [Commented] (JCR-3738) CLONE - Deadlock on LOCAL_REVISION table in clustering environment

Benjamin Papez (JIRA) Fri, 14 Nov 2014 02:19:22 -0800

    [ 
https://issues.apache.org/jira/browse/JCR-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212092#comment-14212092
 ]


Benjamin Papez commented on JCR-3738:
-------------------------------------

We are also seeing those locks frequently when a clustered server is under 
load. We are however running the tests with a custom jackrabbit version (v2.4.5 
and several backports from 2.6.x) and on MySQL, but I think the problem is 
still the same and unsolved on latest jackrabbit. 

To get better information I have also patched the Jackrabbit sources to add 
more logging and trigger automatic thread dumps (therefore line numbers in the 
stacktrace don't match with sources in the source repository). The key to the 
problem is that sometimes the LOCAL_REVISION table update is done under 
transaction (most of the time it is done with autocommit=true). So I added 
automatic thread dump when the LOCAL_REVISION update is about to be done under 
TX and another one when we run into the timeout exception from the database.

Please look at the attached extended-log-with-dumps.txt (I have also excluded 
non-essential stack frames and logs in there).

It looks like:
1.) Thread "scheduler_Worker-1" is in SharedItemStateManager$Update.end calling 
ClusterNode$WorkspaceUpdateChannel.updateCommitted . There in 
AppendRecord.update() the DatabaseJournal is unlocked - batch is ended and 
writeLock in AbstractJournal is released. The next method call: 
setRevision(record.getRevision()); does the update on LOCAL_REVISION with a DB 
connection having autocommit=true , however it is stuck, because

2.) meanwhile another thread "http-bio-8080-exec-244" doing a NodeImpl.unlock 
gets the AbstractJournal writeLock, starts a new batch on DatabaseJournal and 
in doSynch updates the LOCAL_REVISION table within a transaction. This is done 
inside SharedItemStateManager$Update.begin calling 
eventChannel.updateCreated(this). However that thread cannot continue up to the 
DB commit, because it then gets stuck in the next call, when acquiring the 
writeLock within DefaultISMLocking.acquireWriteLock, where it goes into wait() 
in:

            while (writerId != null
                    ? !isSameThreadId(writerId, currentId) : readerCount > 0) {
                wait();
            }

3.) Only after "scheduler_Worker-1" runs into the first timeout and retries the 
SQL command just to run in another DB timeout throwing out the exception, then 
thread "http-bio-8080-exec-244" starts to continue. So I assume that 
"scheduler_Worker-1" was holding that ISM lock, "http-bio-8080-exec-244" was 
waiting for and thus creating that deadlock.

This happens despite the previous call to downgrade the writeLock in 
SharedItemStateManager$Update.end :
            try {
                // downgrade to read lock
                readLock = writeLock.downgrade();
                ...
            } finally {
                ...
                eventChannel.updateCommitted(this, path);

What I do not understand is why in DefaultISMLocking.releaseWriteLock(boolean 
downgrade) the writerId is only set to null, if readerCount == 0. Because of 
that the other thread is stuck in the wait when acquiring the write lock:

            while (writerId != null
                    ? !isSameThreadId(writerId, currentId) : readerCount > 0) {
                wait();
            }

> CLONE - Deadlock on LOCAL_REVISION table in clustering environment
> ------------------------------------------------------------------
>
>                 Key: JCR-3738
>                 URL: https://issues.apache.org/jira/browse/JCR-3738
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: clustering
>    Affects Versions: 2.6.2
>         Environment: CQ5.6.1 with jackrabbit-core 2.6.2 backed off ibm db2 
> v10.5
>            Reporter: Ankush Malhotra
>            Priority: Critical
>         Attachments: before-lock.zip, db-deadlock-info.txt, 
> extended-log-with-dumps.txt, stat-cache.log, threaddumps.zip
>
>
> Original, cloned description:
> > When inserting a lot of nodes concurrently (100/200 threads) the system 
> > hangs generating a deadlock on the LOCAL_REVISION table.
> > There is a thread that starts a transaction but the transaction remains 
> > open, while another thread tries to acquire the lock on the table.
> > This actually happen even if there is only a server up but configured in 
> > cluster mode.
> > I found that in AbstractJournal, we try to write the LOCAL_REVISION even if 
> > we don't sync any record because they're generated by the same journal of 
> > the thread running.
> >
> > Removing this unnecessary (to me :-) ) write to the LOCAL_REVISION table, 
> > remove the deadlock.
> This might not be the exact same case with this issue. See the attached 
> thread dumps etc. for full details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (JCR-3738) CLONE - Deadlock on LOCAL_REVISION table in clustering environment

Reply via email to