[akka-user] "Death watch quarantine" or "How to stress test appropriately"

Joshua Ball Wed, 29 Jan 2014 23:20:09 -0800

Hi,

I encountered some surprising behavior while stress testing my application, 
and boiled it down to the following code:


import org.junit.Test
import akka.actor._
import akka.actor.Identify
import java.util.concurrent.atomic.AtomicInteger
import com.typesafe.config.{Config, ConfigFactory}
import java.io.StringReader

class ThousandActorsWatchingEachOther {

  val countUp = new AtomicInteger(0)

  @Test
  def testThousand() {
    val actorSystem1: ActorSystem = ActorSystem.create("system", 
getLocalHostRemotingConfig(2552))
    val count = 1000
    for (i <- 0 until count) {
      actorSystem1.actorOf(Props(new Echoer), s"$i")
    }
    Thread.sleep(5000l) // give them time to be started
    
    val actorSystem2: ActorSystem = ActorSystem.create("system", 
getLocalHostRemotingConfig(2553))
    for (i <- 0 until count) {
      actorSystem2.actorOf(Props(new Basher(i)), s"basher-$i")
    }

    Thread.sleep(360000l)
  }
  
  class Echoer extends Actor {
    override def receive = {
      case x => sender.tell(x, self)
    }
  }

  class Basher(target: Int) extends Actor {
    override def preStart() {
      
context.actorSelection(s"akka.tcp://system@localhost:2552/user/$target") ! 
Identify()
    }

    override def receive = {
      case ActorIdentity(_, None) =>
        println("Actor not found")
      case ActorIdentity(_, Some(x)) =>
        context.watch(x)
        println(countUp.incrementAndGet())
      case Terminated(x) =>
        println("Terminated!")
    }
  }

  def getLocalHostRemotingConfig(port: Int): Config = {
    ConfigFactory.parseReader(new StringReader(
      """
        |akka {
        |  actor {
        |    provider = "akka.remote.RemoteActorRefProvider"
        |  }
        |  remote {
        |    enabled-transports = ["akka.remote.netty.tcp"]
        |    netty.tcp {
        |      hostname = "localhost"
        |      port = """.stripMargin + port + """
        |    }
        | }
        |}
      """.stripMargin))
  }
}

Also available here: 
https://gist.github.com/sciolizer/8702641/2bc3bb4589b1aad66df51028a638c0bc975f53b5

The test starts up two separate actor systems supporting tcp remoting and 
listening on different ports. The system on port 2552 spawns a thousand 
echoer actors, which just echo back any messages they receive, although 
this is not important for the test. The system on port 2553 spawns a 
thousand actors which 1) locate their corresponding echoer using 
actorSelection and 2) deathwatch the echoer actor once they've found it.

The expected behavior is that the numbers 1 to 1000 are printed out 
(corresponding to the number of deathwatches created) and then the test 
sleeps for an hour.

What actually happens is that somewhere between the 500th and 600th death 
watch, I get the following error:

[WARN] [01/29/2014 20:18:16.934] [system-akka.actor.default-dispatcher-16] 
[Remoting] Association to [akka.tcp://system@localhost:2552] having UID 
[1419818278] is irrecoverably failed. UID is now quarantined and all 
messages to this UID will be delivered to dead letters. Remote actor system 
must be restarted to recover from this situation.

I've found three solutions to make this error go away:

1) Remove the context.watch(x) line
2) Set akka.remote.system-message-buffer-size to a large value, such as 
10000
3) Ramp up the traffic slowly instead of pounding the system immediately.

It seems solutions (1) and (2) are roughly the same, since death watches 
are managed using system messages. I don't understand why the 3rd solution 
works, though.

My implementation of the 3rd solution can be found here: 
https://gist.github.com/sciolizer/8702641/revisions (revision 2)

I warmed the system up by gradually decreasing the sleep time between actor 
creations from 10 milliseconds to 0 milliseconds, for a thousand actor 
creations, and then let it run at full speed for another thousand actors 
(2000 total). So for the second half of its test, the 3rd solution is no 
different from the original problem.

So my two questions are:

Why does solution (3) work? It seems to me that the system-message-buffer 
should still overflow after the 1500th actor or so is created.

and

Is my stress test fair? akka.remote.system-message-buffer-size's default 
value of 1000 seems very low. What are the tradeoffs in increasing it?

Thanks,
Josh "Ua" Ball

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: http://akka.io/faq/
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

[akka-user] "Death watch quarantine" or "How to stress test appropriately"

Reply via email to