Re: Giraph application get stuck, on superstep 4, all workers active but without progress

2016-08-28 Thread José Luis Larroque
Problem solved, i optimize the proccesing of each message, and i could
solve it.

Sorry for the spam guys :D

Bye!
Jose

2016-08-28 15:23 GMT-03:00 José Luis Larroque :

> Ok, i understand what is happening now.
>
> I starting to use more compute threads, because i believed that the
> problem was scalability. I started the application again, using :
> giraph.numComputeThreads=15 (r3.8xlarge has 32 cores)
> giraph.userPartitionCount=240 (4 for each computing thread)
>
> The application gets stuck on only one thread, and only in one partition.
> In this partition, i'm doing a small processing of each message. I have to
> add the vertex id to the end of each message, in order to have the result
> for the Output of that vertex.
>
> The problem here remains in that small process of each message is taking
> to long, and i have the entire cluster waiting for it. I Know that there
> are other tecnologies por post-processing results, maybe i should use one
> of them?
>
> Bye!
> Jose
>
> 2016-08-27 21:33 GMT-03:00 José Luis Larroque :
>
>> Using giraph.maxNumberOfOpenRequests and 
>> giraph.waitForRequestsConfirmation=true
>> didn't solve the problem.
>>
>> I duplicated the netty threads, and assigned the double of the original
>> size to netty buffers, and no change.
>>
>> I condensed the messages, 1000 into 1, and get a lot of less messages,
>> but still, same final results.
>>
>> Please, help.
>>
>> 2016-08-26 21:24 GMT-03:00 José Luis Larroque :
>>
>>> Hi again guys!
>>>
>>> I'm doing BFS search through the Wikipedia (spanish edition) site. I
>>> converted the dump  (
>>> https://dumps.wikimedia.org/eswiki/20160601) into a file that could be
>>> read with Giraph.
>>>
>>> The BFS is searching for paths, and its all ok until get stuck in some
>>> point of the superstep four.
>>>
>>> I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each
>>> node is a r3.8xlarge ec2 instance. The command for executing the BFS is
>>> this one:
>>> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNaveg
>>> acionalesWikiquote -vif ar.edu.info.unlp.tesina.vertic
>>> e.estructuras.IdTextWithComplexValueInputFormat -vip
>>> /user/hduser/input/grafo-wikipedia.txt -vof
>>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>>> -op /user/hduser/output/caminosNavegacionales -w 4 -yh 12 -ca
>>> giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,
>>> giraph.maxMessagesInMemory=10,giraph.isStaticGraph=true,
>>> *giraph.logLevel=Debug*
>>>
>>> Each container have 120GB (almost). I'm using 1000M messages limit in
>>> outOfCore, because i believed that was the problem, but  apparently is not.
>>>
>>> This ones are the master logs (it seems that is waiting for workers for
>>> finish but they just don't...and keeps like this forever...):
>>>
>>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList:
>>> Got finished worker list = [], size = 0, worker list =
>>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=3),
>>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
>>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
>>> size = 4 from /_hadoopBsp/giraph_yarn_applic
>>> ation_1472168758138_0002/_applicationAttemptsDir/0/_superste
>>> pDir/4/_workerFinishedDir
>>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>>
>>> *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for
>>> 116/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed
>>> signaled of false*
>>> ...thirty times same last two lines...
>>> ...
>>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList:
>>> Got finished worker list = [], size = 0, worker list =
>>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=3),
>>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
>>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
>>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
>>> size = 4 from /_hadoopBsp/giraph_yarn_applic
>>> ation_1472168758138_0002/_applicationAttemptsDir/0/_superste
>>> pDir/4/_workerFinishedDir
>>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>>
>>> And in *all* workers, there is no information on what is happening (i'm
>>> testing this with 

Re: Giraph application get stuck, on superstep 4, all workers active but without progress

2016-08-28 Thread José Luis Larroque
Ok, i understand what is happening now.

I starting to use more compute threads, because i believed that the problem
was scalability. I started the application again, using :
giraph.numComputeThreads=15 (r3.8xlarge has 32 cores)
giraph.userPartitionCount=240 (4 for each computing thread)

The application gets stuck on only one thread, and only in one partition.
In this partition, i'm doing a small processing of each message. I have to
add the vertex id to the end of each message, in order to have the result
for the Output of that vertex.

The problem here remains in that small process of each message is taking to
long, and i have the entire cluster waiting for it. I Know that there are
other tecnologies por post-processing results, maybe i should use one of
them?

Bye!
Jose

2016-08-27 21:33 GMT-03:00 José Luis Larroque :

> Using giraph.maxNumberOfOpenRequests and giraph.
> waitForRequestsConfirmation=true didn't solve the problem.
>
> I duplicated the netty threads, and assigned the double of the original
> size to netty buffers, and no change.
>
> I condensed the messages, 1000 into 1, and get a lot of less messages, but
> still, same final results.
>
> Please, help.
>
> 2016-08-26 21:24 GMT-03:00 José Luis Larroque :
>
>> Hi again guys!
>>
>> I'm doing BFS search through the Wikipedia (spanish edition) site. I
>> converted the dump  (
>> https://dumps.wikimedia.org/eswiki/20160601) into a file that could be
>> read with Giraph.
>>
>> The BFS is searching for paths, and its all ok until get stuck in some
>> point of the superstep four.
>>
>> I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each
>> node is a r3.8xlarge ec2 instance. The command for executing the BFS is
>> this one:
>> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
>> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNaveg
>> acionalesWikiquote -vif ar.edu.info.unlp.tesina.vertic
>> e.estructuras.IdTextWithComplexValueInputFormat -vip
>> /user/hduser/input/grafo-wikipedia.txt -vof
>> ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat
>> -op /user/hduser/output/caminosNavegacionales -w 4 -yh 12 -ca
>> giraph.useOutOfCoreMessages=true,giraph.metrics.enable=true,
>> giraph.maxMessagesInMemory=10,giraph.isStaticGraph=true,
>> *giraph.logLevel=Debug*
>>
>> Each container have 120GB (almost). I'm using 1000M messages limit in
>> outOfCore, because i believed that was the problem, but  apparently is not.
>>
>> This ones are the master logs (it seems that is waiting for workers for
>> finish but they just don't...and keeps like this forever...):
>>
>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
>> finished worker list = [], size = 0, worker list =
>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=3),
>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
>> size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_
>> applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>
>> *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for
>> 116/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed
>> signaled of false*
>> ...thirty times same last two lines...
>> ...
>> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
>> finished worker list = [], size = 0, worker list =
>> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=3),
>> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
>> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
>> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
>> size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_0002/_
>> applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
>> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
>> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>>
>> And in *all* workers, there is no information on what is happening (i'm
>> testing this with *giraph.logLevel=Debug* because with the default level
>> of giraph log i was lost), and the workers say this over and over again:
>>
>> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result
>> not ready yet java.util.concurrent.FutureTask@7392f34d
>> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting 

Re: Giraph application get stuck, on superstep 4, all workers active but without progress

2016-08-27 Thread José Luis Larroque
Using giraph.maxNumberOfOpenRequests and
giraph.waitForRequestsConfirmation=true didn't solve the problem.

I duplicated the netty threads, and assigned the double of the original
size to netty buffers, and no change.

I condensed the messages, 1000 into 1, and get a lot of less messages, but
still, same final results.

Please, help.

2016-08-26 21:24 GMT-03:00 José Luis Larroque :

> Hi again guys!
>
> I'm doing BFS search through the Wikipedia (spanish edition) site. I
> converted the dump  (
> https://dumps.wikimedia.org/eswiki/20160601) into a file that could be
> read with Giraph.
>
> The BFS is searching for paths, and its all ok until get stuck in some
> point of the superstep four.
>
> I'm using a cluster of 5 nodes (4 slaves core, 1 Master) on AWS. Each node
> is a r3.8xlarge ec2 instance. The command for executing the BFS is this one:
> /home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar
> ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionale
> sWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.
> IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt
> -vof ar.edu.info.unlp.tesina.vertice.estructuras.
> IdTextWithComplexValueOutputFormat -op 
> /user/hduser/output/caminosNavegacionales
> -w 4 -yh 12 -ca giraph.useOutOfCoreMessages=
> true,giraph.metrics.enable=true,giraph.maxMessagesInMemory=
> 10,giraph.isStaticGraph=true,*giraph.logLevel=Debug*
>
> Each container have 120GB (almost). I'm using 1000M messages limit in
> outOfCore, because i believed that was the problem, but  apparently is not.
>
> This ones are the master logs (it seems that is waiting for workers for
> finish but they just don't...and keeps like this forever...):
>
> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
> finished worker list = [], size = 0, worker list =
> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=3),
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
> size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_
> 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>
> *16/08/26 00:43:08 DEBUG zk.PredicateLock: waitMsecs: Wait for
> 116/08/26 00:43:18 DEBUG zk.PredicateLock: waitMsecs: Got timed
> signaled of false*
> ...thirty times same last two lines...
> ...
> 6/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
> 16/08/26 00:43:08 DEBUG master.BspServiceMaster: barrierOnWorkerList: Got
> finished worker list = [], size = 0, worker list =
> [Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=3),
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001),
> Worker(hostname=ip-172-31-29-15.ec2.internal, MRtaskID=2, port=30002),
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=4, port=30004)],
> size = 4 from /_hadoopBsp/giraph_yarn_application_1472168758138_
> 0002/_applicationAttemptsDir/0/_superstepDir/4/_workerFinishedDir
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-3]
> MASTER_ZOOKEEPER_ONLY - 0 finished out of 4 on superstep 4
>
> And in *all* workers, there is no information on what is happening (i'm
> testing this with *giraph.logLevel=Debug* because with the default level
> of giraph log i was lost), and the workers say this over and over again:
>
> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Future result not
> ready yet java.util.concurrent.FutureTask@7392f34d
> 16/08/26 01:05:08 INFO utils.ProgressableUtils: waitFor: Waiting for
> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@34a37f82
>
> Before starting the superstep 4, the information on each worker was the
> following one
> 16/08/26 00:43:08 INFO yarn.GiraphYarnTask: [STATUS: task-2]
> startSuperstep: WORKER_ONLY - Attempt=0, Superstep=4
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: startSuperstep:
> addressesAndPartitions[Worker(hostname=ip-172-31-29-14.ec2.internal,
> MRtaskID=0, port=3), Worker(hostname=ip-172-31-29-16.ec2.internal,
> MRtaskID
> =1, port=30001), Worker(hostname=ip-172-31-29-15.ec2.internal,
> MRtaskID=2, port=30002), Worker(hostname=ip-172-31-29-14.ec2.internal,
> MRtaskID=4, port=30004)]
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 0
> Worker(hostname=ip-172-31-29-14.ec2.internal, MRtaskID=0, port=3)
> 16/08/26 00:43:08 DEBUG worker.BspServiceWorker: 1
> Worker(hostname=ip-172-31-29-16.ec2.internal, MRtaskID=1, port=30001)
> 16/08/26 00:43:08 DEBUG