[
https://issues.apache.org/jira/browse/GIRAPH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884503#comment-13884503
]
Kristen Hardwick commented on GIRAPH-828:
-----------------------------------------
Yes, the results are the same even without that command. I also updated the
text above to have an actual link where it says "Pastebin Link" (sorry about
that).
The command I ran:
hadoop jar
giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
org.apache.giraph.GiraphRunner -Dgiraph.logLevel=DEBUG
-Dgiraph.SplitMasterWorker=false -Dgiraph.zkList="localhost:2181"
-Dgiraph.zkSessionMsecTimeout=600000 -Dgiraph.useInputSplitLocality=false
org.apache.giraph.examples.SimplePageRankComputation -vif
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip
/user/spry/input -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat
-op /user/spry/PageRank -w 2 -mc
org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
The first time:
14/01/28 14:36:57 INFO yarn.GiraphYarnClient: Completed Giraph:
org.apache.giraph.examples.SimplePageRankComputation: FAILED, total running
time: 0 minutes, 47 seconds.
[spry@hadoop2 giraph]$ hadoop fs -ls /user/spry/PageRank
Found 2 items
-rw-r--r-- 3 yarn hdfs 0 2014-01-28 14:36
/user/spry/PageRank/_SUCCESS
-rw-r--r-- 3 yarn hdfs 66 2014-01-28 14:36
/user/spry/PageRank/part-m-00001
The second time (after removing the /user/spry/PageRank directory):
14/01/28 14:41:34 INFO yarn.GiraphYarnClient: Completed Giraph:
org.apache.giraph.examples.SimplePageRankComputation: FAILED, total running
time: 0 minutes, 42 seconds.
[spry@hadoop2 giraph]$ hadoop fs -ls /user/spry/PageRank
Found 2 items
-rw-r--r-- 3 yarn hdfs 0 2014-01-28 14:41
/user/spry/PageRank/_SUCCESS
-rw-r--r-- 3 yarn hdfs 44 2014-01-28 14:41
/user/spry/PageRank/part-m-00002
The third time (Switching to a new directory in case that matters. Same command
except for "-op /user/spry/PageRank2"):
14/01/28 14:43:51 INFO yarn.GiraphYarnClient: Completed Giraph:
org.apache.giraph.examples.SimplePageRankComputation: FAILED, total running
time: 0 minutes, 42 seconds.
[spry@hadoop2 giraph]$ hadoop fs -ls /user/spry/PageRank2
Found 2 items
-rw-r--r-- 3 yarn hdfs 0 2014-01-28 14:43
/user/spry/PageRank2/_SUCCESS
-rw-r--r-- 3 yarn hdfs 44 2014-01-28 14:43
/user/spry/PageRank2/part-m-00002
> Race condition during Giraph cleanup phase
> ------------------------------------------
>
> Key: GIRAPH-828
> URL: https://issues.apache.org/jira/browse/GIRAPH-828
> Project: Giraph
> Issue Type: Bug
> Affects Versions: 1.1.0
> Environment: Giraph 1.1,
> Hadoop 2.2.0,
> Java 1.7.0_45
> Reporter: Kristen Hardwick
> Fix For: 1.1.0
>
>
> Running the exact same launch command twice, making no other changes, has
> different completion results. For example the first time the application will
> fail, and the second time it will succeed. Just for proof, this is what
> happened when I tried to run the SimpleShortestPathsComputation example:
> [PasteBin Link|http://pastebin.com/Qswb98dq]. This happens consistently,
> although the job does fail much more often than it succeeds.
> The PageRank example also has the same issue. In fact, the timing issue is
> even more obvious there. I followed directions
> [here|http://marsty5.com/2013/05/29/run-example-in-giraph-pagerank/] and ran
> the SimplePageRankComputation example with this command:
> {code}
> hadoop jar
> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar
> org.apache.giraph.GiraphRunner -Dgiraph.cleanupCheckpointsAfterSuccess=false
> -Dgiraph.logLevel=DEBUG -Dgiraph.SplitMasterWorker=false
> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000
> -Dgiraph.useInputSplitLocality=false
> org.apache.giraph.examples.SimplePageRankComputation -vif
> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip
> /user/spry/input -vof
> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op
> /user/spry/PageRank -w 2 -mc
> org.apache.giraph.examples.SimplePageRankComputation\$SimplePageRankMasterCompute
> {code}
> The job technically failed, but I did get output from part file 1 (I expected
> to have values printed for all vertices between 0 and 4).
> {code}
> 0 0.16682289373110673
> 4 0.17098446073203233
> 2 0.17098446073203233
> {code}
> I ran the exact same command again (with no changes to the environment except
> for deleting the /user/spry/PageRank HDFS directory) and got no part files. I
> ran it one more time and got only the data from part file 2:
> {code}
> 1 0.24178880797750438
> 3 0.24178880797750438
> {code}
> I tried a few more times, but I haven't been able to see both part files in
> the output directory yet.
> In the logs, I see hopeful things like this:
> {code}
> 14/01/22 09:47:48 INFO master.MasterThread: setup: Took 3.144 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: input superstep: Took 2.582
> seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 0: Took 0.827 seconds.
> ...
> 14/01/22 09:47:48 INFO master.MasterThread: superstep 30: Took 0.56 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: shutdown: Took 2.591 seconds.
> 14/01/22 09:47:48 INFO master.MasterThread: total: Took 30.18 seconds.
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master is ready to commit final
> job output data.
> {code}
> and like this:
> {code}
> 14/01/22 09:47:48 INFO yarn.GiraphYarnTask: Master has committed the final
> job output data.
> 14/01/22 09:47:48 DEBUG ipc.Client: Stopping client
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to
> hadoop2.j7.master/127.0.0.1:8020 from yarn: closed
> 14/01/22 09:47:48 DEBUG ipc.Client: IPC Client (660189515) connection to
> hadoop2.j7.master/127.0.0.1:8020 from yarn: stopped, remaining connections 0
> {code}
> Really only one of the containers even fails. And it's with a
> DataStreamer/LeaseExpired exception saying that the part file no longer
> exists. This log is from the run where part file 2 was not written out:
> {code}
> 14/01/22 09:47:48 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
> No lease on
> /user/spry/PageRank/_temporary/1/_temporary/attempt_1389643303411_0029_m_000002_1/part-m-00002:
> File does not exist. Holder DFSClient_NONMAPREDUCE_1153765281_1 does not
> have any open files.
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567)
> ...
> 14/01/22 09:47:48 ERROR worker.BspServiceWorker: unregisterHealth: Got
> failure, unregistering health on
> /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
> on superstep 30
> 14/01/22 09:47:48 DEBUG zookeeper.ClientCnxn: Reading reply
> sessionid:0x1438d139efc0039, packet:: clientPath:null serverPath:null
> finished:false header:: 589,2 replyHeader:: 589,13968,-101 request::
> '/_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2,-1
> response:: null
> 14/01/22 09:47:48 ERROR graph.GraphTaskManager: run: Worker failure failed on
> another RuntimeException, original expection will be rethrown
> java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't
> delete
> /_hadoopBsp/giraph_yarn_application_1389643303411_0029/_applicationAttemptsDir/0/_superstepDir/30/_workerHealthyDir/localhost_2
> at
> org.apache.giraph.worker.BspServiceWorker.unregisterHealth(BspServiceWorker.java:656)
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)