Re: Giraph application get stuck, on superstep 4, all workers active but without progress

José Luis Larroque Sat, 27 Aug 2016 17:34:52 -0700

Using giraph.maxNumberOfOpenRequests and
giraph.waitForRequestsConfirmation=true didn't solve the problem.


I duplicated the netty threads, and assigned the double of the original
size to netty buffers, and no change.

I condensed the messages, 1000 into 1, and get a lot of less messages, but
still, same final results.

Please, help.

2016-08-26 21:24 GMT-03:00 José Luis Larroque <larroques...@gmail.com>:

> Hi again guys!
>
> I'm doing BFS search through the Wikipedia (spanish edition) site. I
> converted the dump <https://dumps.wikimedia.org/eswiki/20160601/> (
> https://dumps.wikimedia.org/eswiki/20160601) into a file that could be
> read with Giraph.
>
> The BFS is searching for paths, and its all ok until get stuck in some
> point of the superstep four.
>
> I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each node
> is a r3.8xlarge ec2 instance. The command for executing the BFS is this one:
> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionale
> sWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.
> IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt
> -vof ar.edu.info.unlp.tesina.vertice.estructuras.
> IdTextWithComplexValueOutputFormat -op 
> /user/hduser/output/caminosNavegacionales
> -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages=
> true,giraph.metrics.enable=true,giraph.maxMessagesInMemory=
> 1000000000,giraph.isStaticGraph=true,*giraph.logLevel=Debug*
>
> Each container have 120GB (almost). I'm using 1000M messages limit in
> outOfCore, because i believed that was the problem, but  apparently is not.
>
> This ones are the master logs (it seems that is waiting for workers for
> finish but they just don't...and keeps like this forever...):
>
> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
> finished worker list = [], size = 0, worker list =
> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000),
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
> size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_
> 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>
> *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for
> 1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed
> signaled of false*
> ...thirty times same last two lines...
> ...
> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
> finished worker list = [], size = 0, worker list =
> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000),
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
> size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_
> 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>
> And in *all* workers, there is no information on what is happening (i'm
> testing this with *giraph.logLevel=Debug* because with the default level
> of giraph log i was lost), and the workers say this over and over again:
>
> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result not
> ready yet java.util.concurrent.FutureTask@7392f34d
> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82
>
> Before starting the superstep 4, the information on each worker was the
> following one
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2]
> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep:
> addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal,
> MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal,
> MRtaskID
> =1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal,
> MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal,
> MRtaskID=4, port=30004)]
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2
> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6
> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10
> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14
> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
> 16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory
> (free/total/max) = 92421.41M / 115000.00M / 115000.00M
>
>
> I don't know what is exactly failing:
> - i know that all containers have memory available, on datanodes i check
> that each one had like 50 GB available.
> - I'm not sure if i'm hitting some sort of limit in the use of outOfCore.
> I know that writing messages too fast is dangerous with 1.1 version of
> Giraph, but if i hit that limit, i suppose that the container will fail,
> right?
> - Maybe the connections for zookeeper client aren't enough? I read that
> maybe the 60 default value in zookeeper for *maxClientCnxns* is too small
> for a context like AWS, but i'm not fully aware of the relationship between
> Giraph and Zookeeper for start changing default configuration values
> - Maybe i have to tune outOfCore configuration? Using
> giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true
> like someone recommend here (http://mail-archives.apache.
> org/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B%
> 25majakabi...@fb.com%3E) ?
> - Should i tune the netty configuration? I have the default configuration,
> but i believe that maybe using only 8 netty client and 8 server threads
> will be enough, since that i have only a few workers and maybe too much
> threads of netty are making the overhead that is doing that entire
> application get stuck
> - Using giraph.useBigDataIOForMessages=true didn't help me either, i know
> that each vertex is receiving 100 M or more messages and that property
> should be helpful, but didn't make any difference anyway
>
> As you maybe are suspecting, i have too many hypothesis, that's why i'm
> seeking for help, so i can go in the right direction.
>
> Any help would be greatly appreciated.
>
> Bye!
> Jose
>
>
>
>
>

Re: Giraph application get stuck, on superstep 4, all workers active but without progress

Reply via email to