[
https://issues.apache.org/jira/browse/DERBY-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dag H. Wanvik updated DERBY-4920:
---------------------------------
Attachment: derby-4943-a.stat
derby-4920-a.diff
I was finally able to track this one down by instrumenting RAFContainer4 to
collect a bit of history, cf. uploaded patch "RAFContainer4-instrumentation".
(Thanks to Knut for noticing it happens more often when running on an in-memory
file system e.g. /tmp on Solaris). Turned out the timing was slightly different
there: instead of happening very infrequently, it happened more often than not
when running the test that we saw hang: LobSortTest.
(Aside: trying to reproduce on Solaris made it even harder since the interrupt
would sometimes get swallowed by non-NIO IO calls:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6871306)
In any case, this is what happens: The test shuts down the database to modify
static properties ("SystemPropertyTestSetup") after having created the table
MIXED_LOBS in CleanDatabaseTestSetup#decorateSQL. The raw store demon thread is
performing a check-point at the time the user thread shuts down Derby. The
shutdown interrupts the demon while it is doing NIO. The demon tries to recover
the channel. This fails due a ShutdownException thrown when
InterruptStatus.setInterrupted tries to find the lcc by calling
ContextService#getContextOrNull. (RAFContainer4#recoverContainerAfterInterrupt
calls noteAndClearInterrupt. The latter in turn calls setInterrupted.)
Later during shutdown, the user thread detects that there are still dirty pages
and tries (again) to perform the check-point, but now the channel is closed,
since the demon's recovery failed. This leads to the hang because the user
thread keeps waiting for recovery to happen.
The fix is simple, just make ContextService#getContextOrNull ignore
ShutdownException (as we did for DERBY-4911).
Note that this conforms my hypothesis that the check-pointing at shutdown has
been faulty since we introduced NIO containers in 10.2: The interrupted state
of the channel was masked by the shutdown. It would only ever be an issue if
there was a background thread doing check-pointing at shutdown time, however.
The next reboot would read the logs and recover, of course, so that's probably
why we never have had a report on this.
Patch details:
* InterruptStatus.java
Make sure we always catch ShutdownException when we access the context looking
for lcc.
* RAFContainer4.java
Improved some comments, added a final, added a max count for a retry loop that
missed it: will throw FILE_IO_INTERRUPTED if it exceeds MAX_INTERRUPT_RETRIES
(as elsewhere when we give up).
Running regressions again.
> suites.All stuck in RAFContainer4.awaitRestoreChannel()
> -------------------------------------------------------
>
> Key: DERBY-4920
> URL: https://issues.apache.org/jira/browse/DERBY-4920
> Project: Derby
> Issue Type: Bug
> Components: Store
> Affects Versions: 10.8.0.0
> Reporter: Knut Anders Hatlen
> Assignee: Dag H. Wanvik
> Attachments: derby-4920-a.diff, DERBY-4920_db2compatibility.txt,
> derby-4943-a.stat, jstack.19071.txt
>
>
> I have suites.All running on trunk, and it seems to be stuck in a call to
> RAFContainer4.awaitRestoreChannel(). Here's the stack for the waiting thread:
> "main" prio=3 tid=0x08070800 nid=0x2 waiting on condition [0xfe61e000]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.derby.impl.store.raw.data.RAFContainer4.awaitRestoreChannel(RAFContainer4.java:698)
> at
> org.apache.derby.impl.store.raw.data.RAFContainer4.writePage(RAFContainer4.java:593)
> at
> org.apache.derby.impl.store.raw.data.CachedPage.writePage(CachedPage.java:787)
> at
> org.apache.derby.impl.store.raw.data.CachedPage.clean(CachedPage.java:610)
> at
> org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(ConcurrentCache.java:551)
> at
> org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(ConcurrentCache.java:509)
> at
> org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(ConcurrentCache.java:460)
> at
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(BaseDataFileFactory.java:1211)
> at
> org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(LogToFile.java:1710)
> at
> org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(LogToFile.java:1507)
> at org.apache.derby.impl.store.raw.RawStore.stop(RawStore.java:368)
> at
> org.apache.derby.impl.services.monitor.TopService.stop(TopService.java:442)
> at
> org.apache.derby.impl.services.monitor.TopService.shutdown(TopService.java:393)
> at
> org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:229)
> at
> org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:199)
> at
> org.apache.derby.jdbc.InternalDriver.connect(InternalDriver.java:231)
> at
> org.apache.derby.jdbc.AutoloadedDriver.connect(AutoloadedDriver.java:119)
> at java.sql.DriverManager.getConnection(DriverManager.java:582)
> at java.sql.DriverManager.getConnection(DriverManager.java:154)
> at
> org.apache.derbyTesting.junit.DriverManagerConnector.getConnectionByAttributes(DriverManagerConnector.java:137)
> at
> org.apache.derbyTesting.junit.DriverManagerConnector.shutEngine(DriverManagerConnector.java:120)
> at
> org.apache.derbyTesting.junit.TestConfiguration.shutdownEngine(TestConfiguration.java:1599)
> at
> org.apache.derbyTesting.junit.SystemPropertyTestSetup.setUp(SystemPropertyTestSetup.java:83)
> at junit.extensions.TestSetup$1.protect(TestSetup.java:20)
> at junit.framework.TestResult.runProtected(TestResult.java:124)
> at junit.extensions.TestSetup.run(TestSetup.java:25)
> at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
> at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
> at junit.framework.TestResult.runProtected(TestResult.java:124)
> at junit.extensions.TestSetup.run(TestSetup.java:25)
> at
> org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
> at junit.framework.TestSuite.runTest(TestSuite.java:230)
> at junit.framework.TestSuite.run(TestSuite.java:225)
> at junit.framework.TestSuite.runTest(TestSuite.java:230)
> at junit.framework.TestSuite.run(TestSuite.java:225)
> at junit.framework.TestSuite.runTest(TestSuite.java:230)
> at junit.framework.TestSuite.run(TestSuite.java:225)
> at junit.textui.TestRunner.doRun(TestRunner.java:121)
> at junit.textui.TestRunner.start(TestRunner.java:185)
> at junit.textui.TestRunner.main(TestRunner.java:143)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.