Re: [Dev] High Availability Tests - MB

Hasitha Hiranya Fri, 15 Jun 2012 21:58:23 -0700

Hi charith/shammi/srinath,

I conducted the test "*MB Cluster pointing to a single cassandra and zk
node -- Test for broker fail over when broker nodes are down*"


*Test environment:*
*
*

   - 3 machine setup M1,M2,M3.
   - BR1,BR2 are at M2 while BR3,BR4 are at M3.
   - External cassandra server is running at M3. External zookeeper server
   is running at M2.
   - 4 JMS clients were used to test. Queue senders are at M1,M2,M3 one at
   each machine, while Queue receiver is at machine M1. Client failover is
   defined in order BR1,BR2,BR3,BR4.



*Test steps and results*


   - Up all 4 brokers. Sent 20 messages from client at M2, and tried to
   receive using client at M1. No broker killings. *All messages received.
   No exceptions. No duplications.*

*
*

   - *Up all 4 brokers. Send 20 messages to cluster. Down BR1. Now try to
   receive. Send from client at M2, and receive using client at M1. (now
   according to failover messages will be received from BR2). **No messages
   were received. All messages were lost. Mgt console showed 0 messages.
   After that made BR1 live again and ran queue receiver. Still no message was
   received.*

*
*

   - *Redid above test. Amazingly only 5 messages were received. Mgt
   console said 9 messages are remaining. But could not receive them using the
   queue receiver client. (20 - (5+9) = 6). Thus six messages are lost.*

*
*

   - *Cleared up cassandra to delete unreceivable messages and reset the
   setup.*



   - *Send 20 messages from client at M1, 20 messages from client at M2.
   See if total is 40 in Mgt console. Try to receive them by client at M3. All
   messages (40) were received. No exceptions. No duplication.*

*
*

   - *Fail-over check: Up all four brokers. Use queue Sender at M1 and M2
   to send 2000 messages each and use client at M3 to receive messages.
   Activate receiver after about 500 messages are sent by each client. Down
   BR1 at 700th message and down BR2 at 1400 th message. Observe failover at
   sender side and receiver side. *
      - *Receive started ok.*
      - *When BR1 down it did not closed properly. Closed the terminal.
      sender side failover and receiver side failover was successful. Still
      sending and receiving goes ok.*
      - *After 1500 messges were send by each sender client, receiver
      stopped receiving further messages (only received 859 messages).
      Senders went ok. *
      - *When MB2 was killed sender side failover was ok. But receiver
      still got nothing further.*
      - *Senders continued to send all 2000 messages each. Now, after all
      messages were sent, mgt console of BR4 said there are 3482 messages
      in the queue. *
      - *Now stopped the queue receiver and started again. Still no
      messages were received. Interesting thing is after restarting the
      queue receiver message count which was 3482 became immediately 0 without
      receiving any messages. *
      - *Thus 3482+859 = 4341 there is a message loss there. Further all
      3482 messages that was supposed to be in the queue is totally lost.*



   - Up all brokers. send 1000 messages using client at M1. after 500
   messages were sent kill BR1. Wait until all messages are sent and then try
   to receive without failover (keeping BR1 still down).* **Failover
   happened smoothly. After all 1000 messages were sent receiver was activated
   but no messages were received. Management console of BR4 said 0 messages.
   Thus all messages seemed to be lost.*


*Exceptions:*

   - Sometimes during startup exception
      - Exception during startup:
      me.prettyprint.hector.api.exceptions.HInvalidRequestException:
      InvalidRequestException(why:QpidQueues already exists in keyspace
      QpidKeySpace)



   - Sometimes when starting two nodes in same machine one fail to start
      - Exception during startup:
      me.prettyprint.hector.api.exceptions.HInvalidRequestException:
      InvalidRequestException(why:topicSubscribers already exists in keyspace
      QpidKeySpace)

me.prettyprint.hector.api.exceptions.HInvalidRequestException:
InvalidRequestException(why:topicSubscribers already exists in keyspace
QpidKeySpace)

at
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:45)

at
me.prettyprint.cassandra.service.ThriftCluster$4.execute(ThriftCluster.java:105)

at
me.prettyprint.cassandra.service.ThriftCluster$4.execute(ThriftCluster.java:92)

at
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103

at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:258)

at
me.prettyprint.cassandra.service.ThriftCluster.addColumnFamily(ThriftCluster.java:109)

at
me.prettyprint.cassandra.service.ThriftCluster.addColumnFamily(ThriftCluster.java:84)

at
org.wso2.andes.server.store.util.CassandraDataAccessHelper.createColumnFamily(CassandraDataAccessHelper.java:13

at
org.wso2.andes.server.store.CassandraMessageStore.createKeySpace(CassandraMessageStore.java:590)

at
org.wso2.andes.server.store.CassandraMessageStore.performCommonConfiguration(CassandraMessageStore.java:1489

at
org.wso2.andes.server.store.CassandraMessageStore.configureConfigStore(CassandraMessageStore.java:1605)

at
org.wso2.andes.server.virtualhost.VirtualHostImpl.initialiseMessageStore(VirtualHostImpl.java:407)

at
org.wso2.andes.server.virtualhost.VirtualHostImpl.<init>(VirtualHostImpl.java:235)

at
org.wso2.andes.server.virtualhost.VirtualHostImpl.<init>(VirtualHostImpl.java:171)

at
org.wso2.andes.server.registry.ApplicationRegistry.createVirtualHost(ApplicationRegistry.java:566)

at
org.wso2.andes.server.registry.ApplicationRegistry.initialiseVirtualHosts(ApplicationRegistry.java:325)

at
org.wso2.andes.server.registry.ApplicationRegistry.initialise(ApplicationRegistry.java:263)

at
org.wso2.andes.server.registry.ApplicationRegistry.initialise(ApplicationRegistry.java:149)

at org.wso2.andes.server.Broker.startupImpl(Broker.java:142)

at org.wso2.andes.server.Broker.startup(Broker.java:102)

at org.wso2.andes.server.Main.startBroker(Main.java:227)

at org.wso2.andes.server.Main.execute(Main.java:220)

at org.wso2.andes.server.Main.<init>(Main.java:63)

at org.wso2.andes.server.Main.main(Main.java:53)

at
org.wso2.carbon.andes.internal.QpidServiceComponent.activate(QpidServiceComponent.java:183)


*Notes*


   - There are problems when suddenly shutting down nodes.
   - We have to look into why sometimes message count suddenly become 0
   when receive client is activated.
   - We have to look into why further messages are not received even if
   messages are said to be in the queue by Mgt console.
   - There is some problem sending messages and receiving them from JMS
   clients at different machines with failover.
   - I believe there is no receiving problems with client code as I have
   checked it with MB 1.0.2 release and message receive happens OK.



On Fri, Jun 15, 2012 at 10:00 AM, Charith Wickramarachchi
<[email protected]>wrote:

> I'd like to suggest following breakdown of the test scenarios we need to
> do in following order
>
> *Broker level*
>
> 1) MB Cluster pointing to a single cassandra and zk node -- Test for
> broker fail over when broker nodes are down
>
> 2) MB Cluster pointing to a cassandra cluster and a zk node -- Test for
> broker fail over when broker nodes are down
>
> 3) MB Cluster pointing to a cassandra cluster and zk cluster node -- Test
> for broker fail over when broker nodes are down
>
>
> *Cassandra/Zk level *
> *
> *
> 4) MB Cluster pointing to a  cassandra cluster and zk cluster  -- Test for
> fail over when cassandra nodes are down (use correct replication factors)
>
> 5) MB Cluster pointing to a  cassandra cluster and zk cluster  -- Test for
> fail over when zk nodes are down (use correct replication factors , test
> for new queue creations too )
>
>
> *Hybrid*
>
> 6) MB cluster created with internal Cassandra and zk clusters - Test for
> broker fail over (which means when broker is down internal zk and cassandra
> nodes will also be not available )
>
> cheers,
> Charith
>
>
> On Fri, Jun 15, 2012 at 9:39 AM, Charith Wickramarachchi <[email protected]
> > wrote:
>
>> Hi Hasitha ,
>>
>> Can you try pointing to a sperate Cassandra ring instead of using
>> internal ones and killing the seed cassandra nodes with the broker.
>>
>> Basically 1st scenario what we need to get to work is failover with
>> killing brokers nodes one by one. We do not need to bring Cassandra
>> level fail over in to the picture initially. Since then we have too many
>> variables. After we get this scenario to work we can look
>> at Cassandra level failover.
>>
>> i think the best of way to go through this test can fix session is to do
>> it iteratively in level by level. So lets 1st test broker level fail over
>> and make sure it works. if there are issues lets fix them and then go to
>> next level. otherwise its hard to isolate the issues.
>>
>> cheers,
>> Charith
>>
>> On Fri, Jun 15, 2012 at 9:27 AM, Hasitha Hiranya <[email protected]>wrote:
>>
>>> Hi,
>>>
>>> I did a careful test on MB2 pack with deployment pattern "*Use inbuilt
>>> cassandra server and zoo keeper server for all the broker nodes*".
>>>
>>> Following are results step by step.
>>>
>>> *Environment.*
>>>
>>>    - Three machines M1,M2,M3.
>>>    - Three broker nodes BR1,BR2,BR3 (one per machine).
>>>    - Named cassandra instances in BR2,BR3 as seeds.
>>>    - JMS queue senders at M1,M2
>>>    - JMS queue receiver at M3
>>>
>>> *Test steps and results*.
>>>
>>>
>>>    - Sent 20 messages to cluster using client at M1 and received using
>>>    client at M3. *All messages received. No exceptions. No duplication
>>>    of messages.*
>>>
>>> *
>>> *
>>>
>>>    - Sent 20 messages to BR1, and killed it. But when BR1 is killed
>>>    other servers are seeking for zookeeper connection from that killed node
>>>    with logs printing for each attempt (this increases the log size). Now 
>>> ran
>>>    jms client at M3 node. No messages were received. Management console 
>>> showed
>>>    0 messages. Thus all that 20 messages are lost.
>>>
>>>
>>>
>>>    - Then started BR1 again. Exceptions from BR2,BR3 stopped. Now ran
>>>       queue receiver again. No messages received.
>>>
>>>
>>>
>>>    - Killed *seed node* BR2. Others said BR2 is dead and removed from
>>>    gossip. Zookeeper leader election ran round the ring and confirmed the
>>>    leader. Here killing seed node BR2 did not create exceptions at BR3 
>>> (seed)
>>>    or BR1 (non-seed) ? No zookeeper connection refused exceptions either.
>>>
>>>
>>>
>>>    - Began test again with all BR1,BR2,BR3 up.Killed BR1 and sent 20
>>>    messages (now failover will detect BR2 is on). But now BR2 and BR3 prints
>>>    continuous connection refucred for zookeeper connection at BR1 [1]. Also
>>>    JMS client says connection refused (means BR2 or BR3 is not responding or
>>>    client does not detect that the are up)
>>>
>>> *Exceptions*
>>> [1]. java.net.ConnectException: Connection refused at
>>> sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at
>>> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1143)
>>> [2012-06-15 08:52:04,998] INFO {org.apache.zookeeper.ClientCnxn} - Opening
>>> socket connection to server nodex/192.168.0.100:2181 [2012-06-15
>>> 08:52:05,000] WARN {org.apache.zookeeper.ClientCnxn} - Session
>>> 0x137ee21ac480000 for server null, unexpected error, closing socket
>>> connection and attempting reconnect
>>>
>>> *Issues*
>>>
>>>
>>>    - When queue (which is distributed) is deleted exceptions occur
>>>    - Nodeslist only shows the local node
>>>    - There is a lot of logs printed when starting as a cluster and
>>>    connection seeking.
>>>    - In order to create a binding for the queue first queue listner
>>>    should run before the queue sender which is not acceptable
>>>
>>> I will put some jiras.
>>>
>>> Thanks.
>>>
>>>
>>> On Mon, Jun 11, 2012 at 5:08 PM, Hasitha Hiranya <[email protected]>wrote:
>>>
>>>> Hi Srinath, Charith, Shammi,
>>>>
>>>> I did HA tests for MB M2 pack. Results can be found at
>>>>
>>>> https://docs.google.com/a/wso2.com/spreadsheet/ccc?key=0Ap7HoxWKqqNndEZFRzNpNW1wOGlGckUtcUhBTzlUTkE#gid=0
>>>>
>>>> Please note that test results, exceptions occurred, sent and received
>>>> message details are at different sheets.
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> *Hasitha Abeykoon*
>>>> Software Engineer; WSO2, Inc.; http://wso2.com
>>>> *cell:* *+94 719363063*
>>>> *blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>* *
>>>> *
>>>> *
>>>>
>>>>
>>>
>>>
>>> --
>>> *Hasitha Abeykoon*
>>> Software Engineer; WSO2, Inc.; http://wso2.com
>>> *cell:* *+94 719363063*
>>> *blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>* *
>>> *
>>> *
>>>
>>>
>>
>>
>> --
>> Charith Dhanushka Wickramarachchi
>> Senior Software Engineer
>> WSO2 Inc
>> http://wso2.com/
>> http://wso2.org/
>>
>> blog
>> http://charithwiki.blogspot.com/
>>
>> twitter
>> http://twitter.com/charithwiki
>>
>> Mobile : 0776706568
>>
>>
>>
>
>
> --
> Charith Dhanushka Wickramarachchi
> Senior Software Engineer
> WSO2 Inc
> http://wso2.com/
> http://wso2.org/
>
> blog
> http://charithwiki.blogspot.com/
>
> twitter
> http://twitter.com/charithwiki
>
> Mobile : 0776706568
>
>
>


-- 
*Hasitha Abeykoon*
Software Engineer; WSO2, Inc.; http://wso2.com
*cell:* *+94 719363063*
*blog: **abeykoon.blogspot.com* <http://abeykoon.blogspot.com>* *
*
*

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] High Availability Tests - MB

Reply via email to