[jira] [Updated] (GIRAPH-828) Race condition during Giraph cleanup phase

Kristen Hardwick (JIRA) Tue, 28 Jan 2014 11:41:33 -0800

     [ 
https://issues.apache.org/jira/browse/GIRAPH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Kristen Hardwick updated GIRAPH-828:
------------------------------------

    Description: 
Running the exact same launch command twice, making no other changes, has 
different completion results. For example the first time the application will 
fail, and the second time it will succeed. Just for proof, this is what 
happened when I tried to run the SimpleShortestPathsComputation example: 
[PasteBin Link|http://pastebin.com/Qswb98dq]. This happens consistently, 
although the job does fail much more often than it succeeds.

The PageRank example also has the same issue. In fact, the timing issue is even 
more obvious there. I followed directions 
[here|http://marsty5.com/2013/05/29/run-example-in-giraph-pagerank/] and ran 
the SimplePageRankComputation example with this command:

{code}
hadoop jar 
giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
 org.apache.giraph.GiraphRunner -Dgiraph.cleanupCheckpointsAfterSuccess=false 
-Dgiraph.logLevel=DEBUG -Dgiraph.SplitMasterWorker=false 
-Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000 
-Dgiraph.useInputSplitLocality=false 
org.apache.giraph.examples.SimplePageRankComputation -vif 
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip 
/user/spry/input -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat 
-op /user/spry/PageRank -w 2 -mc 
org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
{code}

The job technically failed, but I did get output from part file 1 (I expected 
to have values printed for all vertices between 0 and 4).
{code}
0    0.16682289373110673
4    0.17098446073203233
2    0.17098446073203233
{code}
I ran the exact same command again (with no changes to the environment except 
for deleting the /user/spry/PageRank HDFS directory) and got no part files. I 
ran it one more time and got only the data from part file 2:
{code}
1    0.24178880797750438
3    0.24178880797750438
{code}
I tried a few more times, but I haven't been able to see both part files in the 
output directory yet.

In the logs, I see hopeful things like this:
{code}
14/01/22 09:47:48 INFO master.MasterThread: setup: Took 3.144 seconds.
14/01/22 09:47:48 INFO master.MasterThread: input superstep: Took 2.582 seconds.
14/01/22 09:47:48 INFO master.MasterThread: superstep 0: Took 0.827 seconds.
...
14/01/22 09:47:48 INFO master.MasterThread: superstep 30: Took 0.56 seconds.
14/01/22 09:47:48 INFO master.MasterThread: shutdown: Took 2.591 seconds.
14/01/22 09:47:48 INFO master.MasterThread: total: Took 30.18 seconds.
14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master is ready to commit final job 
output data.
{code}
and like this:
{code}
14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master has committed the final job 
output data.
14/01/22 09:47:48 DEBUG ipc.Client: Stopping client
14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to 
hadoop2.j7.master/127.0.0.1:8020 from yarn: closed
14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to 
hadoop2.j7.master/127.0.0.1:8020 from yarn: stopped, remaining connections 0
{code}
Really only one of the containers even fails. And it's with a 
DataStreamer/LeaseExpired exception saying that the part file no longer exists. 
This log is from the run where part file 2 was not written out:
{code}
14/01/22 09:47:48 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/user/spry/PageRank/_temporary/1/_temporary/attempt_1389643303411_0029_m_000002_1/part-m-00002:
 File does not exist. Holder DFSClient_NONMAPREDUCE_1153765281_1 does not have 
any open files.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
...
14/01/22 09:47:48 ERROR worker.BspServiceWorker: unregisterHealth: Got failure, 
unregistering health on 
/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
 on superstep 30
14/01/22 09:47:48 DEBUG zookeeper.ClientCnxn: Reading reply 
sessionid:0x1438d139efc0039, packet:: clientPath:null serverPath:null 
finished:false header:: 589,2  replyHeader:: 589,13968,-101  request:: 
'/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2,-1
  response:: null
14/01/22 09:47:48 ERROR graph.GraphTaskManager: run: Worker failure failed on 
another RuntimeException, original expection will be rethrown
java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't 
delete 
/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
        at 
org.apache.giraph.worker.BspServiceWorker.unregisterHealth(BspServiceWorker.java:656)
{code}

  was:
Running the exact same launch command twice, making no other changes, has 
different completion results. For example the first time the application will 
fail, and the second time it will succeed. Just for proof, this is what 
happened when I tried to run the SimpleShortestPathsComputation example: 
PasteBin Link. This happens consistently, although the job does fail much more 
often than it succeeds.

The PageRank example also has the same issue. In fact, the timing issue is even 
more obvious there. I followed directions 
[here|http://marsty5.com/2013/05/29/run-example-in-giraph-pagerank/] and ran 
the SimplePageRankComputation example with this command:

{code}
hadoop jar 
giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
 org.apache.giraph.GiraphRunner -Dgiraph.cleanupCheckpointsAfterSuccess=false 
-Dgiraph.logLevel=DEBUG -Dgiraph.SplitMasterWorker=false 
-Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000 
-Dgiraph.useInputSplitLocality=false 
org.apache.giraph.examples.SimplePageRankComputation -vif 
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip 
/user/spry/input -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat 
-op /user/spry/PageRank -w 2 -mc 
org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
{code}

The job technically failed, but I did get output from part file 1 (I expected 
to have values printed for all vertices between 0 and 4).
{code}
0    0.16682289373110673
4    0.17098446073203233
2    0.17098446073203233
{code}
I ran the exact same command again (with no changes to the environment except 
for deleting the /user/spry/PageRank HDFS directory) and got no part files. I 
ran it one more time and got only the data from part file 2:
{code}
1    0.24178880797750438
3    0.24178880797750438
{code}
I tried a few more times, but I haven't been able to see both part files in the 
output directory yet.

In the logs, I see hopeful things like this:
{code}
14/01/22 09:47:48 INFO master.MasterThread: setup: Took 3.144 seconds.
14/01/22 09:47:48 INFO master.MasterThread: input superstep: Took 2.582 seconds.
14/01/22 09:47:48 INFO master.MasterThread: superstep 0: Took 0.827 seconds.
...
14/01/22 09:47:48 INFO master.MasterThread: superstep 30: Took 0.56 seconds.
14/01/22 09:47:48 INFO master.MasterThread: shutdown: Took 2.591 seconds.
14/01/22 09:47:48 INFO master.MasterThread: total: Took 30.18 seconds.
14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master is ready to commit final job 
output data.
{code}
and like this:
{code}
14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master has committed the final job 
output data.
14/01/22 09:47:48 DEBUG ipc.Client: Stopping client
14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to 
hadoop2.j7.master/127.0.0.1:8020 from yarn: closed
14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to 
hadoop2.j7.master/127.0.0.1:8020 from yarn: stopped, remaining connections 0
{code}
Really only one of the containers even fails. And it's with a 
DataStreamer/LeaseExpired exception saying that the part file no longer exists. 
This log is from the run where part file 2 was not written out:
{code}
14/01/22 09:47:48 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 No lease on 
/user/spry/PageRank/_temporary/1/_temporary/attempt_1389643303411_0029_m_000002_1/part-m-00002:
 File does not exist. Holder DFSClient_NONMAPREDUCE_1153765281_1 does not have 
any open files.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
...
14/01/22 09:47:48 ERROR worker.BspServiceWorker: unregisterHealth: Got failure, 
unregistering health on 
/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
 on superstep 30
14/01/22 09:47:48 DEBUG zookeeper.ClientCnxn: Reading reply 
sessionid:0x1438d139efc0039, packet:: clientPath:null serverPath:null 
finished:false header:: 589,2  replyHeader:: 589,13968,-101  request:: 
'/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2,-1
  response:: null
14/01/22 09:47:48 ERROR graph.GraphTaskManager: run: Worker failure failed on 
another RuntimeException, original expection will be rethrown
java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't 
delete 
/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
        at 
org.apache.giraph.worker.BspServiceWorker.unregisterHealth(BspServiceWorker.java:656)
{code}


> Race condition during Giraph cleanup phase
> ------------------------------------------
>
>                 Key: GIRAPH-828
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-828
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>         Environment: Giraph 1.1,
> Hadoop 2.2.0,
> Java 1.7.0_45
>            Reporter: Kristen Hardwick
>             Fix For: 1.1.0
>
>
> Running the exact same launch command twice, making no other changes, has 
> different completion results. For example the first time the application will 
> fail, and the second time it will succeed. Just for proof, this is what 
> happened when I tried to run the SimpleShortestPathsComputation example: 
> [PasteBin Link|http://pastebin.com/Qswb98dq]. This happens consistently, 
> although the job does fail much more often than it succeeds.
> The PageRank example also has the same issue. In fact, the timing issue is 
> even more obvious there. I followed directions 
> [here|http://marsty5.com/2013/05/29/run-example-in-giraph-pagerank/] and ran 
> the SimplePageRankComputation example with this command:
> {code}
> hadoop jar 
> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
>  org.apache.giraph.GiraphRunner -Dgiraph.cleanupCheckpointsAfterSuccess=false 
> -Dgiraph.logLevel=DEBUG -Dgiraph.SplitMasterWorker=false 
> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000 
> -Dgiraph.useInputSplitLocality=false 
> org.apache.giraph.examples.SimplePageRankComputation -vif 
> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip 
> /user/spry/input -vof 
> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op 
> /user/spry/PageRank -w 2 -mc 
> org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
> {code}
> The job technically failed, but I did get output from part file 1 (I expected 
> to have values printed for all vertices between 0 and 4).
> {code}
> 0    0.16682289373110673
> 4    0.17098446073203233
> 2    0.17098446073203233
> {code}
> I ran the exact same command again (with no changes to the environment except 
> for deleting the /user/spry/PageRank HDFS directory) and got no part files. I 
> ran it one more time and got only the data from part file 2:
> {code}
> 1    0.24178880797750438
> 3    0.24178880797750438
> {code}
> I tried a few more times, but I haven't been able to see both part files in 
> the output directory yet.
> In the logs, I see hopeful things like this:
> {code}
> 14/01/22 09:47:48 INFO master.MasterThread: setup: Took 3.144 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: input superstep: Took 2.582 
> seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 0: Took 0.827 seconds.
> ...
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 30: Took 0.56 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: shutdown: Took 2.591 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: total: Took 30.18 seconds.
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master is ready to commit final 
> job output data.
> {code}
> and like this:
> {code}
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master has committed the final 
> job output data.
> 14/01/22 09:47:48 DEBUG ipc.Client: Stopping client
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to 
> hadoop2.j7.master/127.0.0.1:8020 from yarn: closed
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to 
> hadoop2.j7.master/127.0.0.1:8020 from yarn: stopped, remaining connections 0
> {code}
> Really only one of the containers even fails. And it's with a 
> DataStreamer/LeaseExpired exception saying that the part file no longer 
> exists. This log is from the run where part file 2 was not written out:
> {code}
> 14/01/22 09:47:48 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /user/spry/PageRank/_temporary/1/_temporary/attempt_1389643303411_0029_m_000002_1/part-m-00002:
>  File does not exist. Holder DFSClient_NONMAPREDUCE_1153765281_1 does not 
> have any open files.
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
> ...
> 14/01/22 09:47:48 ERROR worker.BspServiceWorker: unregisterHealth: Got 
> failure, unregistering health on 
> /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
>  on superstep 30
> 14/01/22 09:47:48 DEBUG zookeeper.ClientCnxn: Reading reply 
> sessionid:0x1438d139efc0039, packet:: clientPath:null serverPath:null 
> finished:false header:: 589,2  replyHeader:: 589,13968,-101  request:: 
> '/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2,-1
>   response:: null
> 14/01/22 09:47:48 ERROR graph.GraphTaskManager: run: Worker failure failed on 
> another RuntimeException, original expection will be rethrown
> java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't 
> delete 
> /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
>         at 
> org.apache.giraph.worker.BspServiceWorker.unregisterHealth(BspServiceWorker.java:656)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (GIRAPH-828) Race condition during Giraph cleanup phase

Reply via email to