Hi Robert,


On Mon, Mar 30, 2015 at 7:13 PM, Robert Metzger <[email protected]> wrote:

> Thanks for the quick reply.
>
> This is the setting:
>
> watch-failure-detector{
>    heartbeat-interval = 10 s
>    acceptable-heartbeat-pause = 100 s
>
> The above setting means that you accept 100s of silence from a remote
host, therefore no DeathWatch events will be generated earlier than 100
seconds. This explains why you get late notifications. The pre 2.3.9
scenario "worked" accidentally, now it really works and respects this
setting properly.

-Endre



>    threshold = 12
> }
>
>
> On Mon, Mar 30, 2015 at 5:03 PM, Endre Varga <[email protected]>
> wrote:
>
>> Hi Robert
>>
>> What is your watch failure detector setting? Detection speed depends on
>> those. There was a bug in earlier remoting that published internal
>> AddressTerminated messages when it was not supposed to (remoting does not
>> consider unreachable machines as dead, that decision is taken by remote
>> DeathWatch or clustering).
>>
>> -Endre
>>
>> On Mon, Mar 30, 2015 at 4:51 PM, Robert Metzger <[email protected]>
>> wrote:
>>
>>> A quick follow-up question: I've upgraded Akka from 2.3.7 to 2.3.9. I've
>>> noticed that failed remote machines are detected much later in 2.3.9 than
>>> In 2.3.7. Akka detected failed machines in less than 5 seconds with 2.3.7.
>>> With 2.3.9 it took much more time, in the example below almost 2 minutes.
>>> I haven't investigated the issue closer. Maybe this is also caused by
>>> our system.
>>>
>>> Did anything with respect to failure detection change between the two
>>> releases?
>>>
>>>
>>> When using Flink on YARN, there are actually two systems monitoring the
>>> JVMs: Akka and YARN. From the timestamps in the log, one can easily see
>>> that the time until a failed JVM is detected is much longer with 2.3.9
>>>
>>> With Akka 2.3.9:
>>> *15:07:56,922 *WARN  akka.remote.ReliableDeliverySupervisor
>>>            - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Disassociated].
>>> *yarn detects failure --> 15:07:58,130* INFO
>>>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - Container
>>> container_1426807853451_0035_01_000002 is completed with diagnostics:
>>> Container killed on request. Exit code is 143
>>> Container exited with a non-zero exit code 143
>>> Killed by external signal
>>> *15:08:04,163* WARN  akka.remote.ReliableDeliverySupervisor
>>>            - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> *15:08:14,143* WARN  akka.remote.ReliableDeliverySupervisor
>>>            - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> 15:08:24,149 WARN  akka.remote.ReliableDeliverySupervisor
>>>          - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> 15:08:34,138 WARN  akka.remote.ReliableDeliverySupervisor
>>>          - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> 15:08:44,154 WARN  akka.remote.ReliableDeliverySupervisor
>>>          - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> 15:08:54,158 WARN  akka.remote.ReliableDeliverySupervisor
>>>          - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> 15:09:04,146 WARN  akka.remote.ReliableDeliverySupervisor
>>>          - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> 15:09:14,150 WARN  akka.remote.ReliableDeliverySupervisor
>>>          - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> ....
>>> *15:09:44,165* WARN  akka.remote.ReliableDeliverySupervisor
>>>            - Association with remote system [akka.tcp://
>>> [email protected]:39280] has failed, address is now gated for [5000]
>>> ms. Reason is: [Association failed with [akka.tcp://[email protected]:39280
>>> ]].
>>> 15:09:54,144 WARN  akka.remote.RemoteWatcher
>>>         - Detected unreachable: [akka.tcp://[email protected]:39280]
>>> *akka sends Terminated --> 15:09:54,154* INFO
>>>  org.apache.flink.runtime.instance.InstanceManager             -
>>> Unregistered task manager akka.tcp://[email protected]:39280. Number
>>> of registered task managers 15. Number of available slots 15
>>>
>>> Difference between YARN and Akka: 2 Minutes.
>>>
>>>
>>> With Akka 2.3.7:
>>>
>>> 16:47:23,859 WARN  akka.remote.ReliableDeliverySupervisor
>>>          - Association with remote system [akka.tcp://
>>> [email protected]:45854] has failed, address is now gated for [5000]
>>> ms. Reason is: [Disassociated].
>>> *16:47:26,337* INFO
>>>  org.apache.flink.yarn.ApplicationMaster$anonfun$2$anon$1    - Container
>>> container_1426807853451_0038_01_000005 is completed with diagnostics:
>>> Container killed on request. Exit code is 143
>>> Container exited with a non-zero exit code 143
>>> Killed by external signal
>>>
>>> 16:47:37,786 WARN  Remoting
>>>          - Tried to associate with unreachable remote address [akka.tcp://
>>> [email protected]:45854]. Address is now gated for 5000 ms, all
>>> messages to this address will be delivered to dead letters. Reason:
>>> Connection refused: /130.149.21.2:45854
>>> *16:47:37,795* INFO  org.apache.flink.runtime.instance.InstanceManager
>>>             - Unregistered task manager akka.tcp://
>>> [email protected]:45854. Number of registered task managers 10. Number
>>> of available slots 10.
>>>
>>> Difference between YARN and Akka: 11 seconds.
>>>
>>> On Tue, Mar 24, 2015 at 9:01 PM, Robert Metzger <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>> We are currently using Akka 2.3.7. I'm going to set the version to
>>>> 2.3.9.
>>>>
>>>> I think I've found a way to work around the issue:
>>>> It seems that netty is using "java.util.logging" by default, our
>>>> project is using slfj4. Therefore, I added the following code to let Netty
>>>> use Slf4j:
>>>>
>>>> InternalLoggerFactory.setDefaultFactory(new Slf4JLoggerFactory)
>>>>
>>>> After that, I added the following entry to our log4j properties:
>>>>
>>>> log4j.logger.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, file
>>>>
>>>> With these two changes, our users won't see the exceptions anymore
>>>> during shutdown. My initial tests indicate that this resolved the issue.
>>>>
>>>> If the issue persists with this workaround and on akka 2.3.9, I'll file
>>>> a bug.
>>>>
>>>> On Mon, Mar 23, 2015 at 9:26 AM, Patrik Nordwall <
>>>> [email protected]> wrote:
>>>>
>>>>> Please open a new issue at github
>>>>> <https://github.com/akka/akka/issues> if this can be reproduced with
>>>>> Akka 2.3.9.
>>>>> Thanks,
>>>>> Patrik
>>>>>
>>>>> On Mon, Mar 16, 2015 at 9:43 PM, Robert Metzger <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> sorry for reviving this very old email thread.
>>>>>> Are there any updates or workarounds for this issue?
>>>>>> The bug in the tracker has been marked as "not reproducible", but
>>>>>> we're seeing the error quite often (I would guess in 10% of our automated
>>>>>> tests).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tuesday, November 26, 2013 at 8:07:02 AM UTC+1, Björn Antonsson
>>>>>> wrote:
>>>>>>
>>>>>>>  Hi Greg,
>>>>>>>
>>>>>>> On Tuesday, 26 November 2013 at 03:06, tigerfoot wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I've got two Akka ActorSystems (using Cluster for test cases).
>>>>>>> Config below.  I know--cluster is way overkill here, but I'm testing the
>>>>>>> mechanism for a much larger use.
>>>>>>>
>>>>>>> My test looks like this:
>>>>>>>
>>>>>>> val s1 = Server1()  // initiates an ActorSystem inside the Server1
>>>>>>> class
>>>>>>>
>>>>>>> // Establish a second ActorSystem to simulate another remote client
>>>>>>> on the network sending messages to the first.
>>>>>>> val host = s1.context.myHostname  // my local IP address (not
>>>>>>> localhost)
>>>>>>> val sys2 = ActorSystem("foobar", 
>>>>>>> ConfigFactory.parseString(ServerConfigs.svr2Cfg)
>>>>>>> )
>>>>>>> val selection = sys2.actorSelection( s"""akka.tcp://MyCluster@
>>>>>>> $host:9021/user/awe.server.ServerModule""" )
>>>>>>> selection ! ServerEvent( "acme", "hello", None )
>>>>>>>
>>>>>>> // Simulate doing work
>>>>>>> Thread.sleep(500)
>>>>>>>
>>>>>>> // Shut things down cleanly
>>>>>>> s1.shutdown
>>>>>>> s1.system.awaitTermination
>>>>>>> sys2.shutdown
>>>>>>> sys2.awaitTermination
>>>>>>> assert( s1.system.isTerminated )
>>>>>>> assert( sys2.isTerminated )
>>>>>>>
>>>>>>> Now this functions just fine, but I'm getting the noisy exception at
>>>>>>> the bottom of my post.  Any idea what's causing that?
>>>>>>> I saw this post here and wondered if this is the same thing:
>>>>>>> https://www.assembla.com/spaces/akka/tickets/3096#/activity/ticket:
>>>>>>>
>>>>>>>
>>>>>>> Thanks for looking at the tickets. You are right that your exception
>>>>>>> log is the one in the ticket. It’s not an error in your application. 
>>>>>>> It’s a
>>>>>>> known issue and it looks bad in the logs.
>>>>>>>
>>>>>>> B/
>>>>>>>
>>>>>>>
>>>>>>> Thanks for any ideas!
>>>>>>> Greg
>>>>>>>
>>>>>>> Config:
>>>>>>> val svr1Cfg = s"""
>>>>>>> akka {
>>>>>>> loglevel = "ERROR"
>>>>>>> stdout-loglevel = "ERROR"
>>>>>>> loggers = ["akka.event.slf4j.Slf4jLogger"]
>>>>>>> actor {
>>>>>>> provider = "akka.cluster.ClusterActorRefProvider"
>>>>>>> }
>>>>>>> remote {
>>>>>>> enabled-transports = ["akka.remote.netty.tcp"]
>>>>>>> netty.tcp {
>>>>>>> port = 9021
>>>>>>> }
>>>>>>> }
>>>>>>> cluster {
>>>>>>> seed-nodes = [ "akka.tcp://MyCluster@$myHost:9021" ]
>>>>>>> auto-down = on
>>>>>>> log-info = off
>>>>>>> }
>>>>>>> }
>>>>>>> """
>>>>>>>
>>>>>>> val svr2Cfg = s"""
>>>>>>> akka {
>>>>>>> loglevel = "ERROR"
>>>>>>> stdout-loglevel = "ERROR"
>>>>>>> loggers = ["akka.event.slf4j.Slf4jLogger"]
>>>>>>> actor {
>>>>>>> provider = "akka.cluster.ClusterActorRefProvider"
>>>>>>> }
>>>>>>> remote {
>>>>>>> enabled-transports = ["akka.remote.netty.tcp"]
>>>>>>> netty.tcp {
>>>>>>> port = 9022
>>>>>>> }
>>>>>>> }
>>>>>>> cluster {
>>>>>>> seed-nodes = [ "akka.tcp://MyCluster@$myHost:9021" ]
>>>>>>> auto-down = on
>>>>>>> log-info = off
>>>>>>> }
>>>>>>> }
>>>>>>> """
>>>>>>>
>>>>>>> Nov 25, 2013 8:00:59 PM org.jboss.netty.channel.
>>>>>>> DefaultChannelPipeline
>>>>>>> WARNING: An exception was thrown by an exception handler.
>>>>>>> java.util.concurrent.RejectedExecutionException: Worker has already
>>>>>>> been shutdown
>>>>>>> at org.jboss.netty.channel.socket.nio.AbstractNioSelector.
>>>>>>> registerTask(AbstractNioSelector.java:115)
>>>>>>> at org.jboss.netty.channel.socket.nio.AbstractNioWorker.
>>>>>>> executeInIoThread(AbstractNioWorker.java:73)
>>>>>>> at org.jboss.netty.channel.socket.nio.NioWorker.
>>>>>>> executeInIoThread(NioWorker.java:36)
>>>>>>> at org.jboss.netty.channel.socket.nio.AbstractNioWorker.
>>>>>>> executeInIoThread(AbstractNioWorker.java:57)
>>>>>>> at org.jboss.netty.channel.socket.nio.NioWorker.
>>>>>>> executeInIoThread(NioWorker.java:36)
>>>>>>> at org.jboss.netty.channel.socket.nio.AbstractNioChannelSink.
>>>>>>> execute(AbstractNioChannelSink.java:34)
>>>>>>> at org.jboss.netty.channel.Channels.fireExceptionCaughtLater(
>>>>>>> Channels.java:496)
>>>>>>> at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(
>>>>>>> AbstractChannelSink.java:46)
>>>>>>> at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.
>>>>>>> handleDownstream(OneToOneEncoder.java:54)
>>>>>>> at org.jboss.netty.channel.Channels.disconnect(Channels.java:781)
>>>>>>> at org.jboss.netty.channel.AbstractChannel.disconnect(
>>>>>>> AbstractChannel.java:211)
>>>>>>> at akka.remote.transport.netty.NettyTransport$$anonfun$
>>>>>>> gracefulClose$1.apply(NettyTransport.scala:222)
>>>>>>> at akka.remote.transport.netty.NettyTransport$$anonfun$
>>>>>>> gracefulClose$1.apply(NettyTransport.scala:221)
>>>>>>> at scala.util.Success.foreach(Try.scala:205)
>>>>>>> at scala.concurrent.Future$$anonfun$foreach$1.apply(
>>>>>>> Future.scala:204)
>>>>>>> at scala.concurrent.Future$$anonfun$foreach$1.apply(
>>>>>>> Future.scala:204)
>>>>>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>>>>>>> at akka.dispatch.BatchingExecutor$Batch$$
>>>>>>> anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
>>>>>>> at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(
>>>>>>> BatchingExecutor.scala:82)
>>>>>>> at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(
>>>>>>> BatchingExecutor.scala:59)
>>>>>>> at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(
>>>>>>> BatchingExecutor.scala:59)
>>>>>>> at scala.concurrent.BlockContext$.withBlockContext(
>>>>>>> BlockContext.scala:72)
>>>>>>> at akka.dispatch.BatchingExecutor$Batch.run(
>>>>>>> BatchingExecutor.scala:58)
>>>>>>> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:42)
>>>>>>> at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
>>>>>>> AbstractDispatcher.scala:386)
>>>>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(
>>>>>>> ForkJoinTask.java:260)
>>>>>>> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
>>>>>>> runTask(ForkJoinPool.java:1339)
>>>>>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
>>>>>>> ForkJoinPool.java:1979)
>>>>>>> at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
>>>>>>> ForkJoinWorkerThread.java:107)
>>>>>>>
>>>>>>> --
>>>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>>> >>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>>>>>> >>>>>>>>>> Search the archives: https://groups.google.com/
>>>>>>> group/akka-user
>>>>>>> ---
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "Akka User List" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Björn Antonsson
>>>>>>> Typesafe <http://typesafe.com/> – Reactive Apps on the JVM
>>>>>>> twitter: @bantonsson <http://twitter.com/#!/bantonsson>
>>>>>>>
>>>>>>>  --
>>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>>> >>>>>>>>>> Check the FAQ:
>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>> >>>>>>>>>> Search the archives:
>>>>>> https://groups.google.com/group/akka-user
>>>>>> ---
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Akka User List" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Patrik Nordwall
>>>>> Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
>>>>> Twitter: @patriknw
>>>>>
>>>>>  --
>>>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>>>> >>>>>>>>>> Check the FAQ:
>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>> >>>>>>>>>> Search the archives:
>>>>> https://groups.google.com/group/akka-user
>>>>> ---
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "Akka User List" group.
>>>>> To unsubscribe from this topic, visit
>>>>> https://groups.google.com/d/topic/akka-user/fhKE6aLdxSs/unsubscribe.
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at http://groups.google.com/group/akka-user.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>  --
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ:
>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>> >>>>>>>>>> Search the archives:
>>> https://groups.google.com/group/akka-user
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ:
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> ---
>> You received this message because you are subscribed to a topic in the
>> Google Groups "Akka User List" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/akka-user/fhKE6aLdxSs/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ:
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Akka Team
Typesafe - Reactive apps on the JVM
Blog: letitcrash.com
Twitter: @akkateam

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Reply via email to