Hi Josh,
On Thu, Jan 30, 2014 at 5:46 AM, Joshua Ball <[email protected]> wrote: > Hi, > > I encountered some surprising behavior while stress testing my > application, and boiled it down to the following code: > > import org.junit.Test > import akka.actor._ > import akka.actor.Identify > import java.util.concurrent.atomic.AtomicInteger > import com.typesafe.config.{Config, ConfigFactory} > import java.io.StringReader > > class ThousandActorsWatchingEachOther { > > val countUp = new AtomicInteger(0) > > @Test > def testThousand() { > val actorSystem1: ActorSystem = ActorSystem.create("system", > getLocalHostRemotingConfig(2552)) > val count = 1000 > for (i <- 0 until count) { > actorSystem1.actorOf(Props(new Echoer), s"$i") > } > Thread.sleep(5000l) // give them time to be started > > val actorSystem2: ActorSystem = ActorSystem.create("system", > getLocalHostRemotingConfig(2553)) > for (i <- 0 until count) { > actorSystem2.actorOf(Props(new Basher(i)), s"basher-$i") > } > > Thread.sleep(360000l) > } > > class Echoer extends Actor { > override def receive = { > case x => sender.tell(x, self) > } > } > > class Basher(target: Int) extends Actor { > override def preStart() { > context.actorSelection(s"akka.tcp://system@localhost:2552/user/$target") > ! Identify() > } > > override def receive = { > case ActorIdentity(_, None) => > println("Actor not found") > case ActorIdentity(_, Some(x)) => > context.watch(x) > println(countUp.incrementAndGet()) > case Terminated(x) => > println("Terminated!") > } > } > > def getLocalHostRemotingConfig(port: Int): Config = { > ConfigFactory.parseReader(new StringReader( > """ > |akka { > | actor { > | provider = "akka.remote.RemoteActorRefProvider" > | } > | remote { > | enabled-transports = ["akka.remote.netty.tcp"] > | netty.tcp { > | hostname = "localhost" > | port = """.stripMargin + port + """ > | } > | } > |} > """.stripMargin)) > } > } > > Also available here: > https://gist.github.com/sciolizer/8702641/2bc3bb4589b1aad66df51028a638c0bc975f53b5 > > The test starts up two separate actor systems supporting tcp remoting and > listening on different ports. The system on port 2552 spawns a thousand > echoer actors, which just echo back any messages they receive, although > this is not important for the test. The system on port 2553 spawns a > thousand actors which 1) locate their corresponding echoer using > actorSelection and 2) deathwatch the echoer actor once they've found it. > > The expected behavior is that the numbers 1 to 1000 are printed out > (corresponding to the number of deathwatches created) and then the test > sleeps for an hour. > > What actually happens is that somewhere between the 500th and 600th death > watch, I get the following error: > > [WARN] [01/29/2014 20:18:16.934] [system-akka.actor.default-dispatcher-16] > [Remoting] Association to [akka.tcp://system@localhost:2552] having UID > [1419818278] is irrecoverably failed. UID is now quarantined and all > messages to this UID will be delivered to dead letters. Remote actor system > must be restarted to recover from this situation. > > I've found three solutions to make this error go away: > > 1) Remove the context.watch(x) line > 2) Set akka.remote.system-message-buffer-size to a large value, such as > 10000 > 3) Ramp up the traffic slowly instead of pounding the system immediately. > > The reason for triggering quarantine is that the resend buffer for system messages is filled up, because of the many watch at the same time. > It seems solutions (1) and (2) are roughly the same, since death watches > are managed using system messages. I don't understand why the 3rd solution > works, though. > No, 1 and 2 are very different. 3 works because the early watch messages are delivered and acknowledged before sending the later, and therefore the buffer is not filled up. > My implementation of the 3rd solution can be found here: > https://gist.github.com/sciolizer/8702641/revisions (revision 2) > > I warmed the system up by gradually decreasing the sleep time between > actor creations from 10 milliseconds to 0 milliseconds, for a thousand > actor creations, and then let it run at full speed for another thousand > actors (2000 total). So for the second half of its test, the 3rd solution > is no different from the original problem. > > So my two questions are: > > Why does solution (3) work? It seems to me that the system-message-buffer > should still overflow after the 1500th actor or so is created. > > and > > Is my stress test fair? akka.remote.system-message-buffer-size's default > value of 1000 seems very low. What are the tradeoffs in increasing it? > The tradeoff is increased memory usage, but setting it to 10000 shouldn't be a problem. Cheers, Patrik > > Thanks, > Josh "Ua" Ball > > -- > >>>>>>>>>> Read the docs: http://akka.io/docs/ > >>>>>>>>>> Check the FAQ: http://akka.io/faq/ > >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user > --- > You received this message because you are subscribed to the Google Groups > "Akka User List" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/akka-user. > For more options, visit https://groups.google.com/groups/opt_out. > -- Patrik Nordwall Typesafe <http://typesafe.com/> - Reactive apps on the JVM Twitter: @patriknw -- >>>>>>>>>> Read the docs: http://akka.io/docs/ >>>>>>>>>> Check the FAQ: http://akka.io/faq/ >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user --- You received this message because you are subscribed to the Google Groups "Akka User List" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/akka-user. For more options, visit https://groups.google.com/groups/opt_out.
