Re: [akka-user] "Death watch quarantine" or "How to stress test appropriately"

Patrik Nordwall Thu, 30 Jan 2014 00:08:42 -0800

Hi Josh,


On Thu, Jan 30, 2014 at 5:46 AM, Joshua Ball <[email protected]> wrote:

> Hi,
>
> I encountered some surprising behavior while stress testing my
> application, and boiled it down to the following code:
>
> import org.junit.Test
> import akka.actor._
> import akka.actor.Identify
> import java.util.concurrent.atomic.AtomicInteger
> import com.typesafe.config.{Config, ConfigFactory}
> import java.io.StringReader
>
> class ThousandActorsWatchingEachOther {
>
>   val countUp = new AtomicInteger(0)
>
>   @Test
>   def testThousand() {
>     val actorSystem1: ActorSystem = ActorSystem.create("system",
> getLocalHostRemotingConfig(2552))
>     val count = 1000
>     for (i <- 0 until count) {
>       actorSystem1.actorOf(Props(new Echoer), s"$i")
>     }
>     Thread.sleep(5000l) // give them time to be started
>
>     val actorSystem2: ActorSystem = ActorSystem.create("system",
> getLocalHostRemotingConfig(2553))
>     for (i <- 0 until count) {
>       actorSystem2.actorOf(Props(new Basher(i)), s"basher-$i")
>     }
>
>     Thread.sleep(360000l)
>   }
>
>   class Echoer extends Actor {
>     override def receive = {
>       case x => sender.tell(x, self)
>     }
>   }
>
>   class Basher(target: Int) extends Actor {
>     override def preStart() {
>       context.actorSelection(s"akka.tcp://system@localhost:2552/user/$target")
> ! Identify()
>     }
>
>     override def receive = {
>       case ActorIdentity(_, None) =>
>         println("Actor not found")
>       case ActorIdentity(_, Some(x)) =>
>         context.watch(x)
>         println(countUp.incrementAndGet())
>       case Terminated(x) =>
>         println("Terminated!")
>     }
>   }
>
>   def getLocalHostRemotingConfig(port: Int): Config = {
>     ConfigFactory.parseReader(new StringReader(
>       """
>         |akka {
>         |  actor {
>         |    provider = "akka.remote.RemoteActorRefProvider"
>         |  }
>         |  remote {
>         |    enabled-transports = ["akka.remote.netty.tcp"]
>         |    netty.tcp {
>         |      hostname = "localhost"
>         |      port = """.stripMargin + port + """
>         |    }
>         | }
>         |}
>       """.stripMargin))
>   }
> }
>
> Also available here:
> https://gist.github.com/sciolizer/8702641/2bc3bb4589b1aad66df51028a638c0bc975f53b5
>
> The test starts up two separate actor systems supporting tcp remoting and
> listening on different ports. The system on port 2552 spawns a thousand
> echoer actors, which just echo back any messages they receive, although
> this is not important for the test. The system on port 2553 spawns a
> thousand actors which 1) locate their corresponding echoer using
> actorSelection and 2) deathwatch the echoer actor once they've found it.
>
> The expected behavior is that the numbers 1 to 1000 are printed out
> (corresponding to the number of deathwatches created) and then the test
> sleeps for an hour.
>
> What actually happens is that somewhere between the 500th and 600th death
> watch, I get the following error:
>
> [WARN] [01/29/2014 20:18:16.934] [system-akka.actor.default-dispatcher-16]
> [Remoting] Association to [akka.tcp://system@localhost:2552] having UID
> [1419818278] is irrecoverably failed. UID is now quarantined and all
> messages to this UID will be delivered to dead letters. Remote actor system
> must be restarted to recover from this situation.
>
> I've found three solutions to make this error go away:
>
> 1) Remove the context.watch(x) line
> 2) Set akka.remote.system-message-buffer-size to a large value, such as
> 10000
> 3) Ramp up the traffic slowly instead of pounding the system immediately.
>
>
The reason for triggering quarantine is that the resend buffer for system
messages is filled up, because of the many watch at the same time.



> It seems solutions (1) and (2) are roughly the same, since death watches
> are managed using system messages. I don't understand why the 3rd solution
> works, though.
>

No, 1 and 2 are very different. 3 works because the early watch messages
are delivered and acknowledged before sending the later, and therefore the
buffer is not filled up.


> My implementation of the 3rd solution can be found here:
> https://gist.github.com/sciolizer/8702641/revisions (revision 2)
>
> I warmed the system up by gradually decreasing the sleep time between
> actor creations from 10 milliseconds to 0 milliseconds, for a thousand
> actor creations, and then let it run at full speed for another thousand
> actors (2000 total). So for the second half of its test, the 3rd solution
> is no different from the original problem.
>
> So my two questions are:
>
> Why does solution (3) work? It seems to me that the system-message-buffer
> should still overflow after the 1500th actor or so is created.
>
> and
>
> Is my stress test fair? akka.remote.system-message-buffer-size's default
> value of 1000 seems very low. What are the tradeoffs in increasing it?
>

The tradeoff is increased memory usage, but setting it to 10000 shouldn't
be a problem.

Cheers,
Patrik


>
> Thanks,
> Josh "Ua" Ball
>
> --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: http://akka.io/faq/
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/groups/opt_out.
>



-- 

Patrik Nordwall
Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: http://akka.io/faq/
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [akka-user] "Death watch quarantine" or "How to stress test appropriately"

Reply via email to