Re: Giraph application get stuck, on superstep 4, all workers active but without progress

José Luis Larroque Sun, 28 Aug 2016 11:24:51 -0700

Ok, i understand what is happening now.

I starting to use more compute threads, because i believed that the problem
was scalability. I started the application again, using :
giraph.numComputeThreads=15 (r3.8xlarge has 32 cores)
giraph.userPartitionCount=240 (4 for each computing thread)


The application gets stuck on only one thread, and only in one partition.
In this partition, i'm doing a small processing of each message. I have to
add the vertex id to the end of each message, in order to have the result
for the Output of that vertex.

The problem here remains in that small process of each message is taking to
long, and i have the entire cluster waiting for it. I Know that there are
other tecnologies por post-processing results, maybe i should use one of
them?

Bye!
Jose

2016-08-27 21:33 GMT-03:00 José Luis Larroque <larroques...@gmail.com>:

> Using giraph.maxNumberOfOpenRequests and giraph.
> waitForRequestsConfirmation=true didn't solve the problem.
>
> I duplicated the netty threads, and assigned the double of the original
> size to netty buffers, and no change.
>
> I condensed the messages, 1000 into 1, and get a lot of less messages, but
> still, same final results.
>
> Please, help.
>
> 2016-08-26 21:24 GMT-03:00 José Luis Larroque <larroques...@gmail.com>:
>
>> Hi again guys!
>>
>> I'm doing BFS search through the Wikipedia (spanish edition) site. I
>> converted the dump <https://dumps.wikimedia.org/eswiki/20160601/> (
>> https://dumps.wikimedia.org/eswiki/20160601) into a file that could be
>> read with Giraph.
>>
>> The BFS is searching for paths, and its all ok until get stuck in some
>> point of the superstep four.
>>
>> I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each
>> node is a r3.8xlarge ec2 instance. The command for executing the BFS is
>> this one:
>> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNaveg
>> acionalesWikiquote -vif ar.edu.info.unlp.tesina.vertic
>> e.estructuras.IdTextWithComplexValueInputFormat -vip
>> /user/hduser/input/grafo-wikipedia.txt -vof
>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>> -op /user/hduser/output/caminosNavegacionales -w 4 -yh 120000 -ca
>> giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,
>> giraph.maxMessagesInMemory=1000000000,giraph.isStaticGraph=true,
>> *giraph.logLevel=Debug*
>>
>> Each container have 120GB (almost). I'm using 1000M messages limit in
>> outOfCore, because i believed that was the problem, but  apparently is not.
>>
>> This ones are the master logs (it seems that is waiting for workers for
>> finish but they just don't...and keeps like this forever...):
>>
>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
>> finished worker list = [], size = 0, worker list =
>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000),
>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
>> size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_
>> applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>
>> *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for
>> 1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed
>> signaled of false*
>> ...thirty times same last two lines...
>> ...
>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
>> finished worker list = [], size = 0, worker list =
>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000),
>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
>> size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_
>> applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>
>> And in *all* workers, there is no information on what is happening (i'm
>> testing this with *giraph.logLevel=Debug* because with the default level
>> of giraph log i was lost), and the workers say this over and over again:
>>
>> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result
>> not ready yet java.util.concurrent.FutureTask@7392f34d
>> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for
>> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82
>>
>> Before starting the superstep 4, the information on each worker was the
>> following one
>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2]
>> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep:
>> addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal,
>> MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal,
>> MRtaskID
>> =1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal,
>> MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal,
>> MRtaskID=4, port=30004)]
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1
>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2
>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5
>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6
>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9
>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10
>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13
>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14
>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002)
>> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)
>> 16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory
>> (free/total/max) = 92421.41M / 115000.00M / 115000.00M
>>
>>
>> I don't know what is exactly failing:
>> - i know that all containers have memory available, on datanodes i check
>> that each one had like 50 GB available.
>> - I'm not sure if i'm hitting some sort of limit in the use of outOfCore.
>> I know that writing messages too fast is dangerous with 1.1 version of
>> Giraph, but if i hit that limit, i suppose that the container will fail,
>> right?
>> - Maybe the connections for zookeeper client aren't enough? I read that
>> maybe the 60 default value in zookeeper for *maxClientCnxns* is too
>> small for a context like AWS, but i'm not fully aware of the relationship
>> between Giraph and Zookeeper for start changing default configuration values
>> - Maybe i have to tune outOfCore configuration? Using
>> giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirma
>> tion=true like someone recommend here (http://mail-archives.apache.o
>> rg/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B%25majak
>> abi...@fb.com%3E) ?
>> - Should i tune the netty configuration? I have the default
>> configuration, but i believe that maybe using only 8 netty client and 8
>> server threads will be enough, since that i have only a few workers and
>> maybe too much threads of netty are making the overhead that is doing that
>> entire application get stuck
>> - Using giraph.useBigDataIOForMessages=true didn't help me either, i
>> know that each vertex is receiving 100 M or more messages and that property
>> should be helpful, but didn't make any difference anyway
>>
>> As you maybe are suspecting, i have too many hypothesis, that's why i'm
>> seeking for help, so i can go in the right direction.
>>
>> Any help would be greatly appreciated.
>>
>> Bye!
>> Jose
>>
>>
>>
>>
>>
>

Re: Giraph application get stuck, on superstep 4, all workers active but without progress

Reply via email to