[jira] [Comment Edited] (DERBY-4269) Failover did not succeed in 2 min.: testReplication_Local_3_p6_autocommit_OK

Dag H. Wanvik (JIRA) Wed, 13 Jun 2012 08:33:26 -0700

    [ 
https://issues.apache.org/jira/browse/DERBY-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293709#comment-13293709
 ]


Dag H. Wanvik edited comment on DERBY-4269 at 6/13/12 3:30 PM:
---------------------------------------------------------------

Yes, that was my conclusion too. The usage of property lists is extremely 
convoluted in the Derby boot process, and it was very hard to understand what's 
going on. And since this goes on in a forked VM, using a debugger is tricky 
without affecting the timing... I replaced all usages of java.util.Properties 
with a specialized version (DerbyProperties), so I could track all allocations 
and modifications of properties. Adding to the problem is the fact that 
Properties can have default Properties lists recursively, and this is being 
used. In addition Derby adds two layers of specializations: 
UpdateServiceProperties and DoubleProperties.

Mostly, property lists are passed into constructors without any cloning of them 
so the number of live aliases is high.

It turns out the layering at the time of the NPE in the boot is like this:

DoubleProperties#write instance of
   UpdateServiceProperties#defaults instance of
       Properties#defaults instance of
           Properties (no default)   <----- race on this one

The bottom one is the one that's being changed under our feet in these lines in 
EmbedConnection#bootDatabase:

    :
    // clear these values as some modules hang onto                             
     
    // the properties set corresponding to service.properties                   
     
    // and they shouldn't be interested in these JDBC attributes.               
     
    info.clear(); 
    
It is initially allocated in EmbedConnection#filterProperties:

    private Properties filterProperties(Properties inputSet) {                  
             
        Properties limited = new org.apache.derby.iapi.util.DerbyProperties();  
             
        :

assigning it to "info".

The thread that is doing this is the thread that initially started the slave 
operation (cf. URL "startSlave=true"). This thread hangs until the master has 
been started and replication is underway[1]. When the master comes online this 
thread the connect returns with XRE08 ("REPLICATION_SLAVE_STARTED_OK").

The failover happens in this test almost immediately, and the 
SlaveDatabaseBootThread then starts moving again (it i waiting for more log 
records to replay or failover).

Unfortunately, in the current scenario, the thread doing "startSlave=true" is 
still active, normally it would have finished long before 
SlaveDatabaseBootThread starts moving again.

The Property default at the 4th level above would normally be empty, since its 
being cleared by the "startSlave=true" thread in the "info.clear". Bit since 
it's lagging behind, it only does this "clear" when the SlaveDatabaseBootThread 
has started its boot, leading to the contention.


[1] In EmbedConnection.bootDatabase, this corresponds to the line:

// try to start the service if it doesn't already exist
if (!Monitor.startPersistentService(dbname, info)) {
    
 

                
      was (Author: dagw):
    Yes, that was my conclusion too. The usage of property lists is extremely 
convoluted in the Derby boot process, and it was very hard to understand what's 
going on. And since this goes on in a forked VM, using a debugger is tricky 
without affecting the timing... I replaced all usages of java.util.Properties 
with a specialized version (DerbyProperties), so I could track all allocations 
and modifications of properties. Adding to the problem is the fact that 
Properties can have default Properties lists recursively, and this is being 
used. In addition Derby adds two layers of specializations: 
UpdateServiceProperties and DoubleProperties.

Mostly, property lists are passed into constructors without any cloning of them 
so the number of live aliases is high.

It turns out the layering at the time of the NPE in the boot is like this:

DoubleProperties#write instance of
   UpdateServiceProperties#defaults instance of
       Properties#defaults instance of
           Properties (no default)   <----- race on this one

The bottom one is the one that's being changed under our feet in these lines in 
EmbedConnection#bootDatabase:

    :
    // clear these values as some modules hang onto                             
     
    // the properties set corresponding to service.properties                   
     
    // and they shouldn't be interested in these JDBC attributes.               
     
    info.clear(); 
    
It is initially allocated in EmbedConnection#filterProperties:

    private Properties filterProperties(Properties inputSet) {                  
             
        Properties limited = new org.apache.derby.iapi.util.DerbyProperties();  
             
        :

assigning it to "info".

The thread that is doing this is the thread that initially started the slave 
operation (cf. URL "startSlave=true"). This thread hangs until the master has 
been started and replication is underway. When the master comes online this 
thread the connect returns with XRE08 ("REPLICATION_SLAVE_STARTED_OK").

The failover happens in this test almost immediately, and the 
SlaveDatabaseBootThread then starts moving again (it i waiting for more log 
records to replay or failover).

Unfortunately, in the current scenario, the thread doing "startSlave=true" is 
still active, normally it would have finished long before 
SlaveDatabaseBootThread starts moving again.

The Property default at the 4th level above would normally be empty, since its 
being cleared by the "startSlave=true" thread in the "info.clear". Bit since 
it's lagging behind, it only does this "clear" when the SlaveDatabaseBootThread 
has started its boot, leading to the contention.



 

                  
> Failover did not succeed in 2 min.: testReplication_Local_3_p6_autocommit_OK
> ----------------------------------------------------------------------------
>
>                 Key: DERBY-4269
>                 URL: https://issues.apache.org/jira/browse/DERBY-4269
>             Project: Derby
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 10.4.1.3, 10.4.2.0, 10.5.1.1, 10.5.2.0, 10.5.3.0, 
> 10.6.1.0, 10.6.2.1, 10.7.1.1, 10.8.1.2, 10.8.2.2
>         Environment: OS:
> Microsoft© Windows VistaT Ultimate - 6.0.6001 Service Pack 1 - WindowsNT 0 6
> JVM:
> Sun Microsystems Inc. 
> java version "1.4.2_16"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_16-b05)
> Java HotSpot(TM) Client VM (build 1.4.2_16-b05 mixed mode 32-bit)
>            Reporter: Ole Solberg
>            Assignee: Dag H. Wanvik
>              Labels: derby_triage10_5_2
>         Attachments: 4269-client-jstack.txt, 4269-master.txt, 
> 4269-slave-jstack-before-failover.txt, 4269-slave.txt, DERBY-4269.diff, 
> DERBY-4269.stat, DERBY-4269b.diff, db_master-derby.log, 
> db_slave-derby.log.gz, derby-4269-typo.diff
>
>
> Failover did not succeed.
> 2) 
> testReplication_Local_3_p6_autocommit_OK(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p6)junit.framework.AssertionFailedError:
>  Failover did not succeed.
>       at 
> org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.connectPing(ReplicationRun.java:270)
>       at 
> org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p6.derby_3896(ReplicationRun_Local_3_p6.java:200)
>       at 
> org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_3_p6.testReplication_Local_3_p6_autocommit_OK(ReplicationRun_Local_3_p6.java:86)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at 
> org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:106)
>       at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
>       at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
>       at junit.extensions.TestSetup.run(TestSetup.java:25)
> See 
> http://dbtg.thresher.com/derby/test/Daily/jvm1.4/testing/testlog/vista-64/782274-suitesAll_diff.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (DERBY-4269) Failover did not succeed in 2 min.: testReplication_Local_3_p6_autocommit_OK

Reply via email to