> On Mar 8, 2019, at 1:42 AM, Robert Varga <n...@hq.sk> wrote: > > On 08/03/2019 09:05, Luis Gomez wrote: >> Hi controller experts, >> >> I see sporadically this problem in one of our cluster deployments (Fluorine >> SR1): >> >> 1) Operational DS gets locked in member-1 (leader of >> shard-default-operational) after stopping member-2: >> >> 2019-03-01T14:49:05,189 | INFO | >> opendaylight-cluster-data-akka.actor.default-dispatcher-19 | RpcRegistry >> | 225 - org.opendaylight.controller.sal-clustering-commons >> - 1.8.1 | Actor termination >> Terminated(Actor[akka.tcp://opendaylight-cluster-data@172.18.1.12:2550/user/rpc/broker#-938891863]) >> received >> ... >> 2019-03-01T14:49:16,900 | WARN | epollEventLoopGroup-11-7 | >> TransactionContextWrapper | 233 - >> org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | Failed to >> acquire enqueue operation permit for transaction >> member-1-datastore-operational-fe-8-chn-4-txn-23-0 on shard default >> 2019-03-01T14:49:16,901 | WARN | epollEventLoopGroup-11-8 | >> TransactionContextWrapper | 233 - >> org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | Failed to >> acquire enqueue operation permit for transaction >> member-1-datastore-operational-fe-8-chn-5-txn-24-0 on shard default >> … >> 2019-03-01T14:50:06,051 | WARN | epollEventLoopGroup-11-4 | >> TransactionContextWrapper | 233 - >> org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | Failed to >> acquire enqueue operation permit for transaction >> member-1-datastore-operational-fe-8-chn-21-txn-2-0 on shard default > > This is backpressure being applied on frontend ... > >> 2019-03-01T14:50:06,385 | WARN | epollEventLoopGroup-11-2 | >> TransactionContextWrapper | 233 - >> org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | Failed to >> acquire enqueue operation permit for transaction >> member-1-datastore-operational-fe-8-chn-32-txn-2-0 on shard default >> >> 2) A few seconds later I see A DS transaction getting stuck so maybe this >> explain why DS oper DS is locked: >> >> 019-03-01T14:49:24,032 | WARN | >> opendaylight-cluster-data-shard-dispatcher-70 | ShardDataTree >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> member-1-shard-default-operational: Current transaction >> member-1-datastore-operational-fe-8-txn-1101-0 has timed out after 15707 ms >> in state COMMIT_PENDING >> 2019-03-01T14:49:24,033 | WARN | >> opendaylight-cluster-data-shard-dispatcher-70 | ShardDataTree >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> member-1-shard-default-operational: Transaction >> member-1-datastore-operational-fe-8-txn-1101-0 is still committing, cannot >> abort >> 2019-03-01T14:49:38,333 | WARN | >> opendaylight-cluster-data-akka.actor.default-dispatcher-46 | >> LocalThreePhaseCommitCohort | 233 - >> org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | Failed to >> prepare transaction member-1-datastore-operational-fe-8-txn-1101-0 on backend >> 2019-03-01T14:49:44,032 | WARN | >> opendaylight-cluster-data-shard-dispatcher-36 | ShardDataTree >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> member-1-shard-default-operational: Current transaction >> member-1-datastore-operational-fe-8-txn-1101-0 has timed out after 20000 ms >> in state COMMIT_PENDING >> 2019-03-01T14:49:44,033 | WARN | >> opendaylight-cluster-data-shard-dispatcher-36 | ShardDataTree >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> member-1-shard-default-operational: Transaction >> member-1-datastore-operational-fe-8-txn-1101-0 is still committing, cannot >> abort >> 2019-03-01T14:49:59,033 | WARN | >> opendaylight-cluster-data-shard-dispatcher-59 | ShardDataTree >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> member-1-shard-default-operational: Current transaction >> member-1-datastore-operational-fe-8-txn-1101-0 has timed out after 15000 ms >> in state COMMIT_PENDING >> 2019-03-01T14:49:59,033 | WARN | >> opendaylight-cluster-data-shard-dispatcher-59 | ShardDataTree >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> member-1-shard-default-operational: Transaction >> member-1-datastore-operational-fe-8-txn-1101-0 is still committing, cannot >> abort >> 2019-03-01T14:50:14,033 | WARN | >> opendaylight-cluster-data-shard-dispatcher-58 | ShardDataTree >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> member-1-shard-default-operational: Current transaction >> member-1-datastore-operational-fe-8-txn-1101-0 has timed out after 15000 ms >> in state COMMIT_PENDING >> 2019-03-01T14:50:14,033 | WARN | >> opendaylight-cluster-data-shard-dispatcher-58 | ShardDataTree >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> member-1-shard-default-operational: Transaction >> member-1-datastore-operational-fe-8-txn-1101-0 is still committing, cannot >> abort > > ... due to backend not making progress. The transaction is waiting for > RAFT consensus... > >> 3) The problem of DS locked seems to remit after member-2 comes back: >> >> 2019-03-01T14:50:10,753 | INFO | >> opendaylight-cluster-data-shard-dispatcher-29 | ShardInformation >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> updatePeerAddress for peer member-2-shard-default-operational with address >> akka.tcp://opendaylight-cluster-data@172.18.1.12:2550/user/shardmanager-operational/member-2-shard-default-operational >> >> <akka.tcp://opendaylight-cluster-data@172.18.1.12:2550/user/shardmanager-operational/member-2-shard-default-operational> >> 2019-03-01T14:50:10,754 | INFO | >> opendaylight-cluster-data-shard-dispatcher-89 | Shard >> | 225 - org.opendaylight.controller.sal-clustering-commons - 1.8.1 | >> Peer address for peer member-2-shard-default-operational set to >> akka.tcp://opendaylight-cluster-data@172.18.1.12:2550/user/shardmanager-operational/member-2-shard-default-operational >> >> <akka.tcp://opendaylight-cluster-data@172.18.1.12:2550/user/shardmanager-operational/member-2-shard-default-operational> >> 2019-03-01T14:50:10,754 | INFO | >> opendaylight-cluster-data-shard-dispatcher-29 | ShardManager >> | 233 - org.opendaylight.controller.sal-distributed-datastore - 1.8.1 | >> shard-manager-operational: All Shards are ready - data store operational is >> ready, available count is 0 > > ... which is reached when member-2 comes back. What was happening on the > third member?
Thanks Robert, member-3 is fine, after your comment on backpressure I think I can explains what is going on here: after member-2 goes down, member-1 becomes the owner of an application that gets reinitialized and writes a lot of data in DS, the time member-2 comes back is ~ the time the application finished writing the data so that explains why things go back to normal by then. > >> Any idea how to debug/troubleshoot this problem? > > look for trouble with member-3, debugs at sal-akka-raft :) > > Regards, > Robert
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev