Re: [akka-user] Re: How to make node disconnected from cluster to continue running and finish the jobs

Patrik Nordwall Tue, 04 Feb 2014 23:29:22 -0800

On Wed, Feb 5, 2014 at 12:26 AM, Zoran Jeremic <[email protected]>wrote:


> Thanks for you advices. I tried to change configuration based on your
> suggestions. My configuration looks like:
>
>    actor {
>>         provider = "akka.cluster.ClusterActorRefProvider"
>>     }
>>     remote {
>>         log-remote-lifecycle-events = off
>>         netty.tcp {
>>             hostname = "127.0.0.1"
>>             port = 2551
>>         }
>>     }
>>     cluster {
>>     seed-nodes = [
>>                "akka.tcp://[email protected]:2551"
>>     ]
>>     gossip-interval = 200 ms
>>       leader-actions-interval = 200 ms
>>      unreachable-nodes-reaper-interval = 200 ms
>>
>
Why did you change these intervals?


>
>>   failure-detector {
>>     heartbeat-interval = 10 s
>>
>
Why? Use default interval.


>     acceptable-heartbeat-pause = 10 s
>>     threshold = 10.0
>>   }
>>
>      akka.cluster.use-dispatcher = cluster-dispatcher
>>
>
note that other properties are not prefixed with akka.cluster, and this
might not be used because it might be
akka.cluster.akka.cluster.use-dispatcher


>       cluster-dispatcher {
>>         type = "Dispatcher"
>>         executor = "fork-join-executor"
>>         fork-join-executor {
>>             parallelism-min = 2
>>             parallelism-max = 4
>>
>>         }
>>     }
>>     min-nr-of-members = 2
>>        auto-down= on
>>
>
I was maybe not clear enough. You should remove auto-down=on and only
define auto-down-unreachable-after, otherwise auto-down is immediate as
before, because of backwards compatibility. In the log you probably see
warning:
"[akka.cluster.auto-down] setting is replaced by
[akka.cluster.auto-down-unreachable-after]"


>        auto-down-unreachable-after=100 s
>>
>>     log-dead-letters = 10
>>     log-dead-letters-during-shutdown = on
>>     loglevel="DEBUG"
>>      }
>>   }
>> }
>>
>>
>>
> I performed several stress tests with intensive use of CPU and memory. CPU
> and heap is around 90% of available, but I never got OutOfMemory.  However,
> what I have is a plenty of errors indicating that some actors are
> terminated.
>
> AkkaTimeoutException in
>> doWorkLoop:Recipient[Actor[akka://ClusterSystem/remote/akka.tcp/
>> [email protected]:2551/user/clusterController/crawlerManager/c2/$b/taskQueue#-1444799596]]
>> had already been terminated.
>>
>>
This message comes from a ask between two local actors, probably between
the pool routee c2 and taskQueue. taskQueue is terminated.

If this is related to the cluster auto-down you should see something like
this in the logs:

Marking node(s) as UNREACHABLE ...
Marking unreachable node ... as [Down]
Leader is removing unreachable node ...




> Nodes in the cloud are still connected, after 3 hours of running, but this
> obviously indicates that actors are being forced to terminate for some
> reason which I can't understand. Could you please advice what could be the
> reason for this? I don't have idea how to investigate this further and how
> to fix it.
>

We would have to look closer at your application design and analyze log
files. Typesafe offers commercial support for such questions:
http://typesafe.com/how/subscription

Regards,
Patrik


>
> Thanks
>
> On Monday, 3 February 2014 22:54:37 UTC-8, Patrik Nordwall wrote:
>>
>>
>>
>>
>> On Mon, Feb 3, 2014 at 10:59 PM, Zoran Jeremic <[email protected]>wrote:
>>
>>> >Have you monitored the heap usage? You might have a memory leak.
>>> Yes. I have a node state listener that monitors CPU and Heap, and it's
>>> doesn't indicate any problem with heap until the node becomes unreachable.
>>> Then after 15-20 minutes after node becomes unreachable, the heap and cpu
>>> use is increased, but I believe this is because of many exceptions fired. I
>>> also setup Tomcat to create a heap dump  and log garbage collector
>>> activities. Nothing indicates there is OutOfMemory problem.
>>> Is there anything else that could cause this unreachable node problem? I
>>> though it could only be a network problem if node is still running. When
>>> this happens, the application is still running on the slave node and I can
>>> access it through the REST services it has.
>>>
>>> Are you using auto-down?
>>> In 2.3.0-RC2 you can configure auto-down-unreachable-after to a longer
>>> period, or not configure it at all (default off) and use some
>>> external/manual downing strategy.
>>> Yes. I'm using it, but I'm not sure what could I get with it. I would
>>> like to prevent jobs to be interrupted if possible or at least to make it
>>> not happening that often.
>>>
>>
>> auto-down=on means that the cluster member will be downed and removed
>> when it is detected as unreachable.
>> Setting a longer auto-down-unreachable-after duration gives it a chance
>> to come back without any downing if it's a transient network glitch or long
>> GC pause.
>>
>> /Patrik
>>
>>
>>
>>>  Here is my configuration:
>>>
>>>>  akka {
>>>>     actor {
>>>>         provider = "akka.cluster.ClusterActorRefProvider"
>>>>     }
>>>>     remote {
>>>>         log-remote-lifecycle-events = off
>>>>         netty.tcp {
>>>>                 hostname = "xxx.xxx.xxx.66"
>>>>                 port = 2552
>>>>         }
>>>>     }
>>>>     cluster {
>>>>     seed-nodes = [
>>>>                 "akka.tcp://[email protected]:2551"
>>>>     ]
>>>>
>>>>     min-nr-of-members = 2
>>>>         auto-down= on
>>>>     log-dead-letters = 10
>>>>         log-dead-letters-during-shutdown = on
>>>>         allowLocalRoutees=true
>>>>         userRole=null
>>>>
>>>>     }
>>>>   }
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Monday, 3 February 2014 11:46:03 UTC-8, Zoran Jeremic wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm implementing Web sites crawler using Akka clustering. I have one
>>>> node that plays the role of master node where cluster is initialized from
>>>> the ClusterControllerActor in the following way:
>>>>
>>>> int totalInstances = 2;
>>>>>         int maxInstancesPerNode = 1;
>>>>>         boolean allowLocalRoutees = true;
>>>>>         String useRole = null;
>>>>>         AdaptiveLoadBalancingPool pool = new AdaptiveLoadBalancingPool(
>>>>>                 MixMetricsSelector.getInstance(), 0);
>>>>>         ClusterRouterPoolSettings settings = new
>>>>> ClusterRouterPoolSettings(
>>>>>                 totalInstances, maxInstancesPerNode,
>>>>> allowLocalRoutees, useRole);
>>>>>         crawlerManager = getContext().actorOf(
>>>>>                 new ClusterRouterPool(pool,
>>>>> settings).props(Props.create(
>>>>>                         CrawlerManagerActor.class, getSelf())),
>>>>>                 "crawlerManager");
>>>>
>>>>
>>>>
>>>> Other nodes play the role of slaves, and each node in cluster has one
>>>> CrawlerManagerActor created from master node in AdaptiveLoadBalancingPool.
>>>> For each crawling job (Web site to be crawled) I create one job that is
>>>> delegated to the one of the CrawlingManagers  based on MixMetricsSelector,
>>>> and one job is running only on one node. CrawlerManager creates a set of
>>>> actors that works together on dedicated task. After the job was finished,
>>>> these actors are stopped and killed. This works pretty fine, until the
>>>> moment when slave node is disconnected from cluster. I don't know why the
>>>> slave node is disconnected after 24-48 hours of successful work (I'm
>>>> hosting it on 2 Microsoft Azure instances with Tomcat), but once this
>>>> happen I have a plenty of errors, e.g.:
>>>>
>>>> akka.pattern.AskTimeoutException: Recipient[Actor[akka://Cluster
>>>>> System/remote/akka.tcp/[email protected]:2551/
>>>>> user/clusterController/crawlerManager/c2/$c/statisticsServic
>>>>> e#-1293629832]] had already been terminated.
>>>>
>>>>
>>>> statisticService that is referred here is actually running on the slave
>>>> node and is referred from slave node on the address xxx.xxx.xxx.66:2552,
>>>> but it's referred through the clusterController which is initialized on
>>>> master node, so StatisticServiceActor running on node xxx.xxx.xxx.66:2552
>>>> has the path:
>>>>
>>>> akka://ClusterSystem/remote/akka.tcp/[email protected].
>>>>> xxx.69:2551/user/clusterController/crawlerManager/c2/$a/stat
>>>>> isticsService
>>>>
>>>>
>>>> I would like to make these actors (CrawlerManager and all other nodes
>>>> created for each individual job and running on one node instance) to be
>>>> able to continue running and finishing job even if node is disconnected
>>>> from cluster, so this is not what I'm expecting and what I want. I'm not
>>>> sure if this is expected behaviour or I missed to do something, but I hope
>>>> somebody will have an idea how I could resolve this.
>>>>
>>>> Thanks,
>>>> Zoran
>>>>
>>>  --
>>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>>> >>>>>>>>>> Check the FAQ: http://akka.io/faq/
>>> >>>>>>>>>> Search the archives: https://groups.google.com/
>>> group/akka-user
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "Akka User List" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>>
>>> Visit this group at http://groups.google.com/group/akka-user.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>
>>
>> --
>>
>> Patrik Nordwall
>> Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
>> Twitter: @patriknw
>>
>>   --
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: http://akka.io/faq/
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> ---
> You received this message because you are subscribed to the Google Groups
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/groups/opt_out.
>



-- 

Patrik Nordwall
Typesafe <http://typesafe.com/> -  Reactive apps on the JVM
Twitter: @patriknw

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: http://akka.io/faq/
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [akka-user] Re: How to make node disconnected from cluster to continue running and finish the jobs

Reply via email to