Sorry for the delay in response Srini. We have not tried on master branch (Nitrogen / Akka 2.5). Not sure if such an issue would go away with Akka 2.5 because the circuit breaker is primarily with LevelDB plugin.
For about 20 days, we have not been able to consistently reproduce this issue yet and seen this only once on one of the cluster nodes. We are using plain Ubuntu VMs to bring up the cluster. ‘dmesg’ also did not indicate any issues wrt disk. Some “theoretical” candidates which we started suspecting were a) Compaction of LevelDB colliding with incoming writes – ie. if heavy compaction delays the incoming writes in LevelDB b) Difference in VM’s disk Vs Host’s Disk performance – we may have to beat this by doing heavy ‘dd’ on the disks to see if disk writes are slow at different level altogether But yet to confirm on both. Consulting Akka google groups also did not help because most of the recommendation is that LevelDB is not meant for production. But, even if we use Cassandra (for example for journal persistence), timeout is logically still possible – perhaps Cassandra plugin handles such cases in better manner Regards Muthu From: srini...@gmail.com [mailto:srini...@gmail.com] On Behalf Of Srini Seetharaman Sent: Saturday, August 12, 2017 1:55 AM To: Muthukumaran K Cc: Tom Pantelis; controller-dev@lists.opendaylight.org Subject: Re: [controller-dev] Circuit Breaker timed out Or was there a real disk issue in that machine you were using? On Fri, Aug 11, 2017 at 10:58 AM, Srini Seetharaman <srini.seethara...@gmail.com<mailto:srini.seethara...@gmail.com>> wrote: Muthu, It's worrisome to hear that you've seen this too. Did it go away with Nitrogen or with moving to Akka 2.5 persistence? I am referring to the following params within the persistence section of akka.conf circuit-breaker { max-failures = 10 call-timeout = 10s reset-timeout = 30s } On Thu, Aug 10, 2017 at 10:17 PM, Muthukumaran K <muthukumara...@ericsson.com<mailto:muthukumara...@ericsson.com>> wrote: Hi Tom, Srini, We have also noticed this with Boron very sporadically even without any explicit action taken on shard like Srini did Srini, Are you referring “journal-plugin-fallback” from http://doc.akka.io/docs/akka/current/scala/general/configuration.html#config-akka-persistence ? Regards Muthu From: controller-dev-boun...@lists.opendaylight.org<mailto:controller-dev-boun...@lists.opendaylight.org> [mailto:controller-dev-boun...@lists.opendaylight.org<mailto:controller-dev-boun...@lists.opendaylight.org>] On Behalf Of Srini Seetharaman Sent: Friday, August 11, 2017 9:40 AM To: Tom Pantelis Cc: controller-dev@lists.opendaylight.org<mailto:controller-dev@lists.opendaylight.org> Subject: Re: [controller-dev] Circuit Breaker timed out Thanks Tom. I will investigate further on why the local disk operation failed. Seems strange though because I haven't seen anything in dmesg. The default value for the call-timeout is 10s in akka.conf. On Thu, Aug 10, 2017 at 3:20 PM, Tom Pantelis <tompante...@gmail.com<mailto:tompante...@gmail.com>> wrote: That error is from akka persistence. It happens if the backend persistence plugin doesn't respond back in time. I've only seen this in a CSIT environment whose disk activity was overloaded. The timeouts can be tweaked - I don't recall exactly what they are but you can find them in the akka docs (names contain circuit-breaker). On Thu, Aug 10, 2017 at 6:01 PM, Srini Seetharaman <srini.seethara...@gmail.com<mailto:srini.seethara...@gmail.com>> wrote: Hi Tom, In our ODL deployment that is running in standalone mode with operational store persistence enabled, we saw the following error being printed. Once the member-1-default-operational shard is shutdown, all write transactions after that fail and the system becomes unstable. At this point, we were probably doing less than 10 transactions per second. Any idea what is causing this? Has anyone seen this before? 2017-08-07 19:15:59,622 | ERROR | lt-dispatcher-23 | Shard | 176 - com.typesafe.akka.slf4j - 2.4.7 | Failed to persist event type [org.opendaylight.controller.cluster.raft.ReplicatedLogImplEntry] with sequence number [9897493] for persistenceId [member-1-shard-default-operational]. akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out. 2017-08-07 19:15:59,628 | INFO | lt-dispatcher-24 | Shard | 188 - org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-akka-raft - 1.4.2.Boron-SR2 | Stopping Shard member-1-shard-default-operational 2017-08-07 19:15:59,629 | ERROR | lt-dispatcher-23 | LocalThreePhaseCommitCohort | 193 - org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-distributed-datastore - 1.4.2.Boron-SR2 | Failed to prepare transaction member-1-datastore-operational-fe-5-txn-791019 on backend java.lang.RuntimeException: Transaction aborted due to shutdown. at org.opendaylight.controller.cl<http://org.opendaylight.controller.cl>uster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:399)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at org.opendaylight.controller.cl<http://org.opendaylight.controller.cl>uster.datastore.Shard.postStop(Shard.java:211)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:com.typesafe.akka.actor:2.4.7] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] at akka.persistence.Eventsourced$class.aroundPostStop(Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7] at akka.persistence.UntypedPersistentActor.aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.dungeon.FaultHandling$class.handleChildTerminated(FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(DeathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.watchedActorTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:460)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[175:com.typesafe.akka.actor:2.4.7] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] 2017-08-07 19:15:59,629 | WARN | ult-dispatcher-3 | ConcurrentDOMDataBroker | 193 - org.opendaylight.controller.sa<http://org.opendaylight.controller.sa>l-distributed-datastore - 1.4.2.Boron-SR2 | Tx: DOM-956840 Error during phase CAN_COMMIT, starting Abort java.lang.RuntimeException: Transaction aborted due to shutdown. at org.opendaylight.controller.cl<http://org.opendaylight.controller.cl>uster.datastore.ShardCommitCoordinator.abortPendingTransactions(ShardCommitCoordinator.java:399)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at org.opendaylight.controller.cl<http://org.opendaylight.controller.cl>uster.datastore.Shard.postStop(Shard.java:211)[193:org.opendaylight.controller.sal-distributed-datastore:1.4.2.Boron-SR2] at akka.actor.Actor$class.aroundPostStop(Actor.scala:494)[175:com.typesafe.akka.actor:2.4.7] at akka.persistence.UntypedPersistentActor.akka$persistence$Eventsourced$$super$aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] at akka.persistence.Eventsourced$class.aroundPostStop(Eventsourced.scala:223)[181:com.typesafe.akka.persistence:2.4.7] at akka.persistence.UntypedPersistentActor.aroundPostStop(PersistentActor.scala:168)[181:com.typesafe.akka.persistence:2.4.7] at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.dungeon.FaultHandling$class.handleChildTerminated(FaultHandling.scala:293)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.handleChildTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.dungeon.DeathWatch$class.watchedActorTerminated(DeathWatch.scala:61)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.watchedActorTerminated(ActorCell.scala:374)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:460)[175:com.typesafe.akka.actor:2.4.7] at akka.actor.ActorCell.systemInvoke(ActorCell.scala:483)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:282)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:260)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.run(Mailbox.scala:224)[175:com.typesafe.akka.actor:2.4.7] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)[175:com.typesafe.akka.actor:2.4.7] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)[171:org.scala-lang.scala-library:2.11.8.v20160304-115712-1706a37eb8] 2017-08-07 19:15:59,630 | INFO | lt-dispatcher-17 | LocalActorRef | 176 - com.typesafe.akka.slf4j - 2.4.7 | Message [org.opendaylight.controller.cluster.raft.client.messages.Ge<http://luster.raft.client.messages.Ge>tOnDemandRaftState] from Actor[akka://opendaylight-cluster-data/temp/$b] to Actor[akka://opendaylight-cluster-data/user/shardmanager-operational/member-1-shard-default-operational#-376322108] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
_______________________________________________ controller-dev mailing list controller-dev@lists.opendaylight.org https://lists.opendaylight.org/mailman/listinfo/controller-dev