Dag,

Thanks for analyzing and fixing this strange issue! Stopping replication before the startSlave command had completed was never on my mind :-/

I had a look at you patch though, and I think you can fix this bug with even less code.

From SlaveDatabase.java:86:
    /** Set by the database boot thread if it fails before slave mode
     * has been started properly (i.e., if inBoot is true). This
     * exception will then be reported to the client connection. */
    private volatile StandardException bootException;

bootException is only set in one place - SlaveDatabase#handleShutdown. There you'll also see the reason for the limbo state that made the tests fail: if an exception makes the slave replication code call handleShutdown while booting is in progress, the database is supposed to be shutdown by the client thread when it receives an exception from SlaveDatabase.boot().

As you already found out, that didn't happen because the bootException was set during the 500 millis waiting in verifySuccesfulBoot. However, this should apply to any exception in bootException, not only DATABASE_SEVERITY ones (although I *think* only DB severity exceptions will be reported here).

I would go with the same code that is inside the while. Thus, instead of

+        if (bootException != null &&
+            SQLState.SHUTDOWN_DATABASE.startsWith(
+                bootException.getSQLState()) &&
+ bootException.getSeverity() == ExceptionSeverity.DATABASE_SEVERITY) {

use

+        if (bootException != null)


Dag H. Wanvik (JIRA) wrote:
     [ 
https://issues.apache.org/jira/browse/DERBY-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dag H. Wanvik updated DERBY-4186:
---------------------------------

    Attachment: derby-4186-2.stat
                derby-4186-2.diff

Respin of this patch, #2, with more comments. I also talked to the author of 
this code off-line, Jørgen Løland, and he agreed with my analysis. The new 
patch moves the check for the lost exception to inside the method 
SlaveDataBase.verifySuccessfulBoot.
Added more explanations in the comments.



After failover, test fails when it succeeds in connecting early to failed over 
slave
------------------------------------------------------------------------------------

                Key: DERBY-4186
                URL: https://issues.apache.org/jira/browse/DERBY-4186
            Project: Derby
         Issue Type: Bug
         Components: Replication, Test
   Affects Versions: 10.6.0.0
           Reporter: Dag H. Wanvik
           Assignee: Dag H. Wanvik
        Attachments: bad-slave.txt, derby-4186-2.diff, derby-4186-2.stat, 
derby-4186.diff, derby-4186.stat, ok-slave.txt


Occasionally I see this error in ReplicationRun_Local_3_p3:
1) 
testReplication_Local_3_p3_StateNegativeTests(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3)junit.framework.AssertionFailedError:
 Expected SQLState'08004', but got connection!
        at 
org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.waitForSQLState(ReplicationRun.java:332)
        at 
org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p3.testReplication_Local_3_p3_StateNegativeTests(ReplicationRun_Local_3_p3.java:170)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at 
org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
        at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
        at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
        at junit.extensions.TestSetup.run(TestSetup.java:25)
In the code, after a stopMaster is given to the master (should lead to 
fail-over),
the tests expects to see CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE (08004.C.7), which 
will only succeed if
the tests gets to try to connect before the failover has started. This seems 
wrong. If the failover has completed, it should expect a successful
connect (which boots the database, btw, since its shut down after auccessful 
failover).
Quote from code:
waitForSQLState("08004", 100L, 20, // 08004.C.7 - 
CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE
                slaveDatabasePath + FS + slaveDbSubPath + FS + replicatedDb,
                slaveServerHost, slaveServerPort); // _failOver above fails...
There is a race between the failover on the slave and the test here I think.



--
Jørgen Løland

Reply via email to