[
https://issues.apache.org/jira/browse/AMQ-5082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Scott Feldstein updated AMQ-5082:
---------------------------------
Attachment: 03-07.tgz
I've been doing some more digging on this. It turns out that there is a job on
another vm on the esx box that occurs exactly when things fail. The job is
backing up @ 60GB s3. At that point there is a big hiccup in the system and
all the activemq nodes stop listening and never recover. I tried to increase
the zookeeper timeout from 2s to 10s on the activemq nodes and that still
doesn't help.
I think the bug here is that the nodes never recover from this outage. It
seems like there should be some type of recovery logic when all the nodes lose
their connection and come back to life at some later point in time.
I am adding more logs with DEBUG enabled from 03/07/14.
This set of logs shows the behavior that I am talking about. Starting at
04:52:22,366 the nodes go into a state where none are elected
{code}
$ egrep '"elected":' mq-node1-activemq.log.bug
2014-03-07 04:50:49,325 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000169,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":"0000000171"}),
(0000000171,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:58506","position":-1,"weight":1,"elected":null})))
2014-03-07 04:50:54,453 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000169,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":"0000000171"}),
(0000000171,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:58506","position":-1,"weight":1,"elected":null}),
(0000000173,{"id":"localhost","container":null,"address":null,"position":161531,"weight":1,"elected":null})))
2014-03-07 04:51:06,008 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000169,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":"0000000171"}),
(0000000171,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:58506","position":-1,"weight":1,"elected":null})))
2014-03-07 04:51:42,386 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000169,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":"0000000171"}),
(0000000171,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:52:22,366 DEBUG [main]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null})))
2014-03-07 04:52:24,160 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:52:24,167 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000176,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:52:58,259 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000176,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000177,{"id":"localhost","container":null,"address":null,"position":161531,"weight":1,"elected":null})))
2014-03-07 04:55:12,679 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000176,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:55:12,681 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 04:56:02,014 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000178,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 05:00:20,550 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null})))
2014-03-07 05:01:32,446 DEBUG [main-EventThread]
[org.apache.activemq.leveldb.replicated.MasterElector@112] ZooKeeper group
changed: Map(localhost ->
ListBuffer((0000000174,{"id":"localhost","container":null,"address":"tcp://10.1.1.218:49168","position":-1,"weight":1,"elected":null}),
(0000000175,{"id":"localhost","container":null,"address":null,"position":161609,"weight":1,"elected":null}),
(0000000179,{"id":"localhost","container":null,"address":null,"position":161531,"weight":1,"elected":null})))
{code}
> ActiveMQ replicatedLevelDB cluster breaks, all nodes stop listening
> -------------------------------------------------------------------
>
> Key: AMQ-5082
> URL: https://issues.apache.org/jira/browse/AMQ-5082
> Project: ActiveMQ
> Issue Type: Bug
> Components: activemq-leveldb-store
> Affects Versions: 5.9.0, 5.10.0
> Reporter: Scott Feldstein
> Priority: Critical
> Attachments: 03-07.tgz, mq-node1-cluster.failure,
> mq-node2-cluster.failure, mq-node3-cluster.failure,
> zookeeper.out-cluster.failure
>
>
> I have a 3 node amq cluster and one zookeeper node using a replicatedLevelDB
> persistence adapter.
> {code}
> <persistenceAdapter>
> <replicatedLevelDB
> directory="${activemq.data}/leveldb"
> replicas="3"
> bind="tcp://0.0.0.0:0"
> zkAddress="zookeep0:2181"
> zkPath="/activemq/leveldb-stores"/>
> </persistenceAdapter>
> {code}
> After about a day or so of sitting idle there are cascading failures and the
> cluster completely stops listening all together.
> I can reproduce this consistently on 5.9 and the latest 5.10 (commit
> 2360fb859694bacac1e48092e53a56b388e1d2f0). I am going to attach logs from
> the three mq nodes and the zookeeper logs that reflect the time where the
> cluster starts having issues.
> The cluster stops listening Mar 4, 2014 4:56:50 AM (within 5 seconds).
> The OSs are all centos 5.9 on one esx server, so I doubt networking is an
> issue.
> If you need more data it should be pretty easy to get whatever is needed
> since it is consistently reproducible.
> This bug may be related to AMQ-5026, but looks different enough to file a
> separate issue.
--
This message was sent by Atlassian JIRA
(v6.2#6252)