[
https://issues.apache.org/jira/browse/GIRAPH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885101#comment-13885101
]
Avery Ching commented on GIRAPH-828:
------------------------------------
Can you dump the log of the failed task? It looks like only one of your
workers is succeeding.
> Race condition during Giraph cleanup phase
> ------------------------------------------
>
> Key: GIRAPH-828
> URL: https://issues.apache.org/jira/browse/GIRAPH-828
> Project: Giraph
> Issue Type: Bug
> Affects Versions: 1.1.0
> Environment: Giraph 1.1,
> Hadoop 2.2.0,
> Java 1.7.0_45
> Reporter: Kristen Hardwick
> Fix For: 1.1.0
>
>
> Running the exact same launch command twice, making no other changes, has
> different completion results. For example the first time the application will
> fail, and the second time it will succeed. Just for proof, this is what
> happened when I tried to run the SimpleShortestPathsComputation example:
> [PasteBin Link|http://pastebin.com/Qswb98dq]. This happens consistently,
> although the job does fail much more often than it succeeds.
> The PageRank example also has the same issue. In fact, the timing issue is
> even more obvious there. I followed directions
> [here|http://marsty5.com/2013/05/29/run-example-in-giraph-pagerank/] and ran
> the SimplePageRankComputation example with this command:
> {code}
> hadoop jar
> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
> org.apache.giraph.GiraphRunner -Dgiraph.cleanupCheckpointsAfterSuccess=false
> -Dgiraph.logLevel=DEBUG -Dgiraph.SplitMasterWorker=false
> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000
> -Dgiraph.useInputSplitLocality=false
> org.apache.giraph.examples.SimplePageRankComputation -vif
> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip
> /user/spry/input -vof
> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
> /user/spry/PageRank -w 2 -mc
> org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
> {code}
> The job technically failed, but I did get output from part file 1 (I expected
> to have values printed for all vertices between 0 and 4).
> {code}
> 0 0.16682289373110673
> 4 0.17098446073203233
> 2 0.17098446073203233
> {code}
> I ran the exact same command again (with no changes to the environment except
> for deleting the /user/spry/PageRank HDFS directory) and got no part files. I
> ran it one more time and got only the data from part file 2:
> {code}
> 1 0.24178880797750438
> 3 0.24178880797750438
> {code}
> I tried a few more times, but I haven't been able to see both part files in
> the output directory yet.
> In the logs, I see hopeful things like this:
> {code}
> 14/01/22 09:47:48 INFO master.MasterThread: setup: Took 3.144 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: input superstep: Took 2.582
> seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 0: Took 0.827 seconds.
> ...
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 30: Took 0.56 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: shutdown: Took 2.591 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: total: Took 30.18 seconds.
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master is ready to commit final
> job output data.
> {code}
> and like this:
> {code}
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master has committed the final
> job output data.
> 14/01/22 09:47:48 DEBUG ipc.Client: Stopping client
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to
> hadoop2.j7.master/127.0.0.1:8020 from yarn: closed
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to
> hadoop2.j7.master/127.0.0.1:8020 from yarn: stopped, remaining connections 0
> {code}
> Really only one of the containers even fails. And it's with a
> DataStreamer/LeaseExpired exception saying that the part file no longer
> exists. This log is from the run where part file 2 was not written out:
> {code}
> 14/01/22 09:47:48 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /user/spry/PageRank/_temporary/1/_temporary/attempt_1389643303411_0029_m_000002_1/part-m-00002:
> File does not exist. Holder DFSClient_NONMAPREDUCE_1153765281_1 does not
> have any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
> ...
> 14/01/22 09:47:48 ERROR worker.BspServiceWorker: unregisterHealth: Got
> failure, unregistering health on
> /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
> on superstep 30
> 14/01/22 09:47:48 DEBUG zookeeper.ClientCnxn: Reading reply
> sessionid:0x1438d139efc0039, packet:: clientPath:null serverPath:null
> finished:false header:: 589,2 replyHeader:: 589,13968,-101 request::
> '/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2,-1
> response:: null
> 14/01/22 09:47:48 ERROR graph.GraphTaskManager: run: Worker failure failed on
> another RuntimeException, original expection will be rethrown
> java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't
> delete
> /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
> at
> org.apache.giraph.worker.BspServiceWorker.unregisterHealth(BspServiceWorker.java:656)
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)