[ 
https://issues.apache.org/jira/browse/DERBY-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dag H. Wanvik updated DERBY-4920:
---------------------------------

    Attachment: derby-4943-a.stat
                derby-4920-a.diff

I was finally able to track this one down by instrumenting RAFContainer4 to 
collect a bit of history, cf. uploaded patch "RAFContainer4-instrumentation". 
(Thanks to Knut for noticing it happens more often when running on an in-memory 
file system e.g. /tmp on Solaris). Turned out the timing was slightly different 
there: instead of happening very infrequently, it happened more often than not 
when running the test that we saw hang: LobSortTest.

(Aside: trying to reproduce on Solaris made it even harder since the interrupt 
would sometimes get swallowed by non-NIO IO calls: 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6871306)

In any case, this is what happens: The test shuts down the database to modify 
static properties ("SystemPropertyTestSetup") after having created the table 
MIXED_LOBS in CleanDatabaseTestSetup#decorateSQL. The raw store demon thread is 
performing a check-point at the time the user thread shuts down Derby.  The 
shutdown interrupts the demon while it is doing NIO. The demon tries to recover 
the channel. This fails due a ShutdownException thrown when 
InterruptStatus.setInterrupted tries to find the lcc by calling 
ContextService#getContextOrNull. (RAFContainer4#recoverContainerAfterInterrupt 
calls noteAndClearInterrupt. The latter in turn calls setInterrupted.)

Later during shutdown, the user thread detects that there are still dirty pages 
and tries (again) to perform the check-point, but now the channel is closed, 
since the demon's recovery failed. This leads to the hang because the user 
thread keeps waiting for recovery to happen.

The fix is simple, just make ContextService#getContextOrNull ignore 
ShutdownException (as we did for DERBY-4911).

Note that this conforms my hypothesis that the check-pointing at shutdown has 
been faulty since we introduced NIO containers in 10.2: The interrupted state 
of the channel was masked by the shutdown.  It would only ever be an issue if 
there was a background thread doing check-pointing at shutdown time, however. 
The next reboot would read the logs and recover, of course, so that's probably 
why we never have had a report on this.


Patch details:

* InterruptStatus.java

Make sure we always catch ShutdownException when we access the context looking 
for lcc.

* RAFContainer4.java

Improved some comments, added a final, added a max count for a retry loop that 
missed it: will throw FILE_IO_INTERRUPTED if it exceeds MAX_INTERRUPT_RETRIES 
(as elsewhere when we give up).

Running regressions again.


> suites.All stuck in RAFContainer4.awaitRestoreChannel()
> -------------------------------------------------------
>
>                 Key: DERBY-4920
>                 URL: https://issues.apache.org/jira/browse/DERBY-4920
>             Project: Derby
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 10.8.0.0
>            Reporter: Knut Anders Hatlen
>            Assignee: Dag H. Wanvik
>         Attachments: derby-4920-a.diff, DERBY-4920_db2compatibility.txt, 
> derby-4943-a.stat, jstack.19071.txt
>
>
> I have suites.All running on trunk, and it seems to be stuck in a call to 
> RAFContainer4.awaitRestoreChannel(). Here's the stack for the waiting thread:
> "main" prio=3 tid=0x08070800 nid=0x2 waiting on condition [0xfe61e000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at 
> org.apache.derby.impl.store.raw.data.RAFContainer4.awaitRestoreChannel(RAFContainer4.java:698)
>         at 
> org.apache.derby.impl.store.raw.data.RAFContainer4.writePage(RAFContainer4.java:593)
>         at 
> org.apache.derby.impl.store.raw.data.CachedPage.writePage(CachedPage.java:787)
>         at 
> org.apache.derby.impl.store.raw.data.CachedPage.clean(CachedPage.java:610)
>         at 
> org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(ConcurrentCache.java:551)
>         at 
> org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(ConcurrentCache.java:509)
>         at 
> org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(ConcurrentCache.java:460)
>         at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(BaseDataFileFactory.java:1211)
>         at 
> org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(LogToFile.java:1710)
>         at 
> org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(LogToFile.java:1507)
>         at org.apache.derby.impl.store.raw.RawStore.stop(RawStore.java:368)
>         at 
> org.apache.derby.impl.services.monitor.TopService.stop(TopService.java:442)
>         at 
> org.apache.derby.impl.services.monitor.TopService.shutdown(TopService.java:393)
>         at 
> org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:229)
>         at 
> org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:199)
>         at 
> org.apache.derby.jdbc.InternalDriver.connect(InternalDriver.java:231)
>         at 
> org.apache.derby.jdbc.AutoloadedDriver.connect(AutoloadedDriver.java:119)
>         at java.sql.DriverManager.getConnection(DriverManager.java:582)
>         at java.sql.DriverManager.getConnection(DriverManager.java:154)
>         at 
> org.apache.derbyTesting.junit.DriverManagerConnector.getConnectionByAttributes(DriverManagerConnector.java:137)
>         at 
> org.apache.derbyTesting.junit.DriverManagerConnector.shutEngine(DriverManagerConnector.java:120)
>         at 
> org.apache.derbyTesting.junit.TestConfiguration.shutdownEngine(TestConfiguration.java:1599)
>         at 
> org.apache.derbyTesting.junit.SystemPropertyTestSetup.setUp(SystemPropertyTestSetup.java:83)
>         at junit.extensions.TestSetup$1.protect(TestSetup.java:20)
>         at junit.framework.TestResult.runProtected(TestResult.java:124)
>         at junit.extensions.TestSetup.run(TestSetup.java:25)
>         at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
>         at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
>         at junit.framework.TestResult.runProtected(TestResult.java:124)
>         at junit.extensions.TestSetup.run(TestSetup.java:25)
>         at 
> org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
>         at junit.framework.TestSuite.runTest(TestSuite.java:230)
>         at junit.framework.TestSuite.run(TestSuite.java:225)
>         at junit.framework.TestSuite.runTest(TestSuite.java:230)
>         at junit.framework.TestSuite.run(TestSuite.java:225)
>         at junit.framework.TestSuite.runTest(TestSuite.java:230)
>         at junit.framework.TestSuite.run(TestSuite.java:225)
>         at junit.textui.TestRunner.doRun(TestRunner.java:121)
>         at junit.textui.TestRunner.start(TestRunner.java:185)
>         at junit.textui.TestRunner.main(TestRunner.java:143)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to