Using giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true didn't solve the problem.
I duplicated the netty threads, and assigned the double of the original size to netty buffers, and no change. I condensed the messages, 1000 into 1, and get a lot of less messages, but still, same final results. Please, help. 2016-08-26 21:24 GMT-03:00 José Luis Larroque <larroques...@gmail.com>: > Hi again guys! > > I'm doing BFS search through the Wikipedia (spanish edition) site. I > converted the dump <https://dumps.wikimedia.org/eswiki/20160601/> ( > https://dumps.wikimedia.org/eswiki/20160601) into a file that could be > read with Giraph. > > The BFS is searching for paths, and its all ok until get stuck in some > point of the superstep four. > > I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each node > is a r3.8xlarge ec2 instance. The command for executing the BFS is this one: > /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar > ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionale > sWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras. > IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt > -vof ar.edu.info.unlp.tesina.vertice.estructuras. > IdTextWithComplexValueOutputFormat -op > /user/hduser/output/caminosNavegacionales > -w 4 -yh 120000 -ca giraph.useOutOfCoreMessages= > true,giraph.metrics.enable=true,giraph.maxMessagesInMemory= > 1000000000,giraph.isStaticGraph=true,*giraph.logLevel=Debug* > > Each container have 120GB (almost). I'm using 1000M messages limit in > outOfCore, because i believed that was the problem, but apparently is not. > > This ones are the master logs (it seems that is waiting for workers for > finish but they just don't...and keeps like this forever...): > > 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] > MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 > 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got > finished worker list = [], size = 0, worker list = > [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), > Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), > Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], > size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_ > 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir > 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] > MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 > > *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for > 1000016/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed > signaled of false* > ...thirty times same last two lines... > ... > 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] > MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 > 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got > finished worker list = [], size = 0, worker list = > [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000), > Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001), > Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002), > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)], > size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_ > 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir > 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3] > MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4 > > And in *all* workers, there is no information on what is happening (i'm > testing this with *giraph.logLevel=Debug* because with the default level > of giraph log i was lost), and the workers say this over and over again: > > 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result not > ready yet java.util.concurrent.FutureTask@7392f34d > 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for > org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82 > > Before starting the superstep 4, the information on each worker was the > following one > 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2] > startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4 > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep: > addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal, > MRtaskID=0, port=30000), Worker(hostname=ip-172-31-29-16.ec2.internal, > MRtaskID > =1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal, > MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal, > MRtaskID=4, port=30004)] > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0 > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1 > Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 2 > Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 3 > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 4 > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 5 > Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 6 > Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 7 > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 8 > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 9 > Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 10 > Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 11 > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 12 > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=30000) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 13 > Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 14 > Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002) > 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 15 > Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004) > 16/08/26 00:43:08 DEBUG graph.GraphTaskManager: execute: Memory > (free/total/max) = 92421.41M / 115000.00M / 115000.00M > > > I don't know what is exactly failing: > - i know that all containers have memory available, on datanodes i check > that each one had like 50 GB available. > - I'm not sure if i'm hitting some sort of limit in the use of outOfCore. > I know that writing messages too fast is dangerous with 1.1 version of > Giraph, but if i hit that limit, i suppose that the container will fail, > right? > - Maybe the connections for zookeeper client aren't enough? I read that > maybe the 60 default value in zookeeper for *maxClientCnxns* is too small > for a context like AWS, but i'm not fully aware of the relationship between > Giraph and Zookeeper for start changing default configuration values > - Maybe i have to tune outOfCore configuration? Using > giraph.maxNumberOfOpenRequests and giraph.waitForRequestsConfirmation=true > like someone recommend here (http://mail-archives.apache. > org/mod_mbox/giraph-user/201209.mbox/%3CCC775449.2C4B% > 25majakabi...@fb.com%3E) ? > - Should i tune the netty configuration? I have the default configuration, > but i believe that maybe using only 8 netty client and 8 server threads > will be enough, since that i have only a few workers and maybe too much > threads of netty are making the overhead that is doing that entire > application get stuck > - Using giraph.useBigDataIOForMessages=true didn't help me either, i know > that each vertex is receiving 100 M or more messages and that property > should be helpful, but didn't make any difference anyway > > As you maybe are suspecting, i have too many hypothesis, that's why i'm > seeking for help, so i can go in the right direction. > > Any help would be greatly appreciated. > > Bye! > Jose > > > > >