Rob Vesse created GIRAPH-809:
--------------------------------
Summary: Worker Failure causes ArrayIndexOutOfBounds on
BspServiceMaster
Key: GIRAPH-809
URL: https://issues.apache.org/jira/browse/GIRAPH-809
Project: Giraph
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Rob Vesse
If a worker fails for any reason (e.g. Out of Memory exception) the
BspServiceMaster attempts to recover from a checkpoint. However this code does
not protect itself from the default Giraph behaviour of checkpointing being
disabled thus resulting in the following ArrayIndexOutOfBoundsException:
{noformat}
2013-12-03 10:33:10,844 INFO org.apache.giraph.comm.netty.NettyClient:
connectAllAddresses: Successfully added 0 connections, (0 total connected) 0
failed, 0 failures total.
2013-12-03 10:33:10,844 INFO org.apache.giraph.partition.PartitionBalancer:
balancePartitionsAcrossWorkers: Using algorithm static
2013-12-03 10:33:10,844 INFO org.apache.giraph.partition.PartitionUtils:
analyzePartitionStats: Vertices - Mean: 333, Min:
Worker(hostname=mbp-rvesse.home, MRtaskID=1, port=30001) - 333, Max:
Worker(hostname=mbp-rvesse.home, MRtaskID=2, port=30002) - 334
2013-12-03 10:33:10,844 INFO org.apache.giraph.partition.PartitionUtils:
analyzePartitionStats: Edges - Mean: 50000, Min:
Worker(hostname=mbp-rvesse.home, MRtaskID=1, port=30001) - 49950, Max:
Worker(hostname=mbp-rvesse.home, MRtaskID=2, port=30002) - 50100
2013-12-03 10:33:10,850 INFO org.apache.giraph.master.BspServiceMaster:
barrierOnWorkerList: 0 out of 3 workers finished on superstep 2 on path
/_hadoopBsp/job_201312031028_0001/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
2013-12-03 10:33:10,850 INFO org.apache.giraph.master.BspServiceMaster:
barrierOnWorkerList: Waiting on [mbp-rvesse.home_2, mbp-rvesse.home_3,
mbp-rvesse.home_1]
2013-12-03 10:33:30,148 ERROR org.apache.giraph.master.BspServiceMaster:
superstepChosenWorkerAlive: Missing chosen worker
Worker(hostname=mbp-rvesse.home, MRtaskID=2, port=30002) on superstep 2
2013-12-03 10:33:30,148 INFO org.apache.giraph.master.MasterThread:
masterThread: Coordination of superstep 2 took 19.31 seconds ended with state
WORKER_FAILURE and is now on superstep 2
2013-12-03 10:33:30,156 ERROR org.apache.giraph.master.MasterThread:
masterThread: Master algorithm failed with ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: -1
at
org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1274)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:139)
2013-12-03 10:33:30,157 FATAL org.apache.giraph.graph.GraphMapper:
uncaughtException: OverrideExceptionHandler on thread
org.apache.giraph.master.MasterThread, msg =
java.lang.ArrayIndexOutOfBoundsException: -1, exiting...
java.lang.IllegalStateException: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.giraph.master.MasterThread.run(MasterThread.java:185)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at
org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1274)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:139)
{noformat}
It appears the code in BspServiceMaster does not properly check if the
checkpoints array is empty and just attempts to access the most recent
checkpoint regardless.
--
This message was sent by Atlassian JIRA
(v6.1#6144)