Dag,
Thanks for analyzing and fixing this strange issue! Stopping replication
before the startSlave command had completed was never on my mind :-/
I had a look at you patch though, and I think you can fix this bug with
even less code.
From SlaveDatabase.java:86:
/** Set by the database boot thread if it fails before slave mode
* has been started properly (i.e., if inBoot is true). This
* exception will then be reported to the client connection. */
private volatile StandardException bootException;
bootException is only set in one place - SlaveDatabase#handleShutdown.
There you'll also see the reason for the limbo state that made the tests
fail: if an exception makes the slave replication code call
handleShutdown while booting is in progress, the database is supposed to
be shutdown by the client thread when it receives an exception from
SlaveDatabase.boot().
As you already found out, that didn't happen because the bootException
was set during the 500 millis waiting in verifySuccesfulBoot. However,
this should apply to any exception in bootException, not only
DATABASE_SEVERITY ones (although I *think* only DB severity exceptions
will be reported here).
I would go with the same code that is inside the while. Thus, instead of
+ if (bootException != null &&
+ SQLState.SHUTDOWN_DATABASE.startsWith(
+ bootException.getSQLState()) &&
+ bootException.getSeverity() ==
ExceptionSeverity.DATABASE_SEVERITY) {
use
+ if (bootException != null)
Dag H. Wanvik (JIRA) wrote:
[
https://issues.apache.org/jira/browse/DERBY-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dag H. Wanvik updated DERBY-4186:
---------------------------------
Attachment: derby-4186-2.stat
derby-4186-2.diff
Respin of this patch, #2, with more comments. I also talked to the author of
this code off-line, Jørgen Løland, and he agreed with my analysis. The new
patch moves the check for the lost exception to inside the method
SlaveDataBase.verifySuccessfulBoot.
Added more explanations in the comments.
After failover, test fails when it succeeds in connecting early to failed over
slave
------------------------------------------------------------------------------------
Key: DERBY-4186
URL: https://issues.apache.org/jira/browse/DERBY-4186
Project: Derby
Issue Type: Bug
Components: Replication, Test
Affects Versions: 10.6.0.0
Reporter: Dag H. Wanvik
Assignee: Dag H. Wanvik
Attachments: bad-slave.txt, derby-4186-2.diff, derby-4186-2.stat,
derby-4186.diff, derby-4186.stat, ok-slave.txt
Occasionally I see this error in ReplicationRun_Local_3_p3:
1)
testReplication_Local_3_p3_StateNegativeTests(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3)junit.framework.AssertionFailedError:
Expected SQLState'08004', but got connection!
at
org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.waitForSQLState(ReplicationRun.java:332)
at
org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3.testReplication_Local_3_p3_StateNegativeTests(ReplicationRun_Local_3_p3.java:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at
org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
at junit.extensions.TestSetup.run(TestSetup.java:25)
In the code, after a stopMaster is given to the master (should lead to
fail-over),
the tests expects to see CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE (08004.C.7), which
will only succeed if
the tests gets to try to connect before the failover has started. This seems
wrong. If the failover has completed, it should expect a successful
connect (which boots the database, btw, since its shut down after auccessful
failover).
Quote from code:
waitForSQLState("08004", 100L, 20, // 08004.C.7 -
CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE
slaveDatabasePath + FS + slaveDbSubPath + FS + replicatedDb,
slaveServerHost, slaveServerPort); // _failOver above fails...
There is a race between the failover on the slave and the test here I think.
--
Jørgen Løland