[ 
https://issues.apache.org/jira/browse/GIRAPH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rob Vesse updated GIRAPH-809:
-----------------------------

    Attachment: GIRAPH-809.patch

> Worker Failure causes ArrayIndexOutOfBounds on BspServiceMaster
> ---------------------------------------------------------------
>
>                 Key: GIRAPH-809
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-809
>             Project: Giraph
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Rob Vesse
>         Attachments: GIRAPH-809.patch
>
>
> If a worker fails for any reason (e.g. Out of Memory exception) the 
> BspServiceMaster attempts to recover from a checkpoint.  However this code 
> does not protect itself from the default Giraph behaviour of checkpointing 
> being disabled thus resulting in the following ArrayIndexOutOfBoundsException:
> {noformat}
> 2013-12-03 10:33:10,844 INFO org.apache.giraph.comm.netty.NettyClient: 
> connectAllAddresses: Successfully added 0 connections, (0 total connected) 0 
> failed, 0 failures total.
> 2013-12-03 10:33:10,844 INFO org.apache.giraph.partition.PartitionBalancer: 
> balancePartitionsAcrossWorkers: Using algorithm static
> 2013-12-03 10:33:10,844 INFO org.apache.giraph.partition.PartitionUtils: 
> analyzePartitionStats: Vertices - Mean: 333, Min: 
> Worker(hostname=mbp-rvesse.home, MRtaskID=1, port=30001) - 333, Max: 
> Worker(hostname=mbp-rvesse.home, MRtaskID=2, port=30002) - 334
> 2013-12-03 10:33:10,844 INFO org.apache.giraph.partition.PartitionUtils: 
> analyzePartitionStats: Edges - Mean: 50000, Min: 
> Worker(hostname=mbp-rvesse.home, MRtaskID=1, port=30001) - 49950, Max: 
> Worker(hostname=mbp-rvesse.home, MRtaskID=2, port=30002) - 50100
> 2013-12-03 10:33:10,850 INFO org.apache.giraph.master.BspServiceMaster: 
> barrierOnWorkerList: 0 out of 3 workers finished on superstep 2 on path 
> /_hadoopBsp/job_201312031028_0001/_applicationAttemptsDir/0/_superstepDir/2/_workerFinishedDir
> 2013-12-03 10:33:10,850 INFO org.apache.giraph.master.BspServiceMaster: 
> barrierOnWorkerList: Waiting on [mbp-rvesse.home_2, mbp-rvesse.home_3, 
> mbp-rvesse.home_1]
> 2013-12-03 10:33:30,148 ERROR org.apache.giraph.master.BspServiceMaster: 
> superstepChosenWorkerAlive: Missing chosen worker 
> Worker(hostname=mbp-rvesse.home, MRtaskID=2, port=30002) on superstep 2
> 2013-12-03 10:33:30,148 INFO org.apache.giraph.master.MasterThread: 
> masterThread: Coordination of superstep 2 took 19.31 seconds ended with state 
> WORKER_FAILURE and is now on superstep 2
> 2013-12-03 10:33:30,156 ERROR org.apache.giraph.master.MasterThread: 
> masterThread: Master algorithm failed with ArrayIndexOutOfBoundsException
> java.lang.ArrayIndexOutOfBoundsException: -1
>       at 
> org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1274)
>       at org.apache.giraph.master.MasterThread.run(MasterThread.java:139)
> 2013-12-03 10:33:30,157 FATAL org.apache.giraph.graph.GraphMapper: 
> uncaughtException: OverrideExceptionHandler on thread 
> org.apache.giraph.master.MasterThread, msg = 
> java.lang.ArrayIndexOutOfBoundsException: -1, exiting...
> java.lang.IllegalStateException: java.lang.ArrayIndexOutOfBoundsException: -1
>       at org.apache.giraph.master.MasterThread.run(MasterThread.java:185)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
>       at 
> org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1274)
>       at org.apache.giraph.master.MasterThread.run(MasterThread.java:139)
> {noformat}
> It appears the code in BspServiceMaster does not properly check if the 
> checkpoints array is empty and just attempts to access the most recent 
> checkpoint regardless.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to