[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Owen O'Malley (Commented) (JIRA) Tue, 15 Nov 2011 10:21:17 -0800

    [ 
https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150652#comment-13150652
 ]


Owen O'Malley commented on GIRAPH-61:
-------------------------------------

I don't think this is the right direction. Adding a second job is really 
expensive to get working right. (And worse, it will make us more sensitive to 
security vs. non-security.) It seems like a much better fix is to be able to 
use the initial state as a checkpoint.

It would also be good if we could handle worker failure more gracefully, but 
that is a much bigger project.
                
> Worker's early failure will cause the whole system fail
> -------------------------------------------------------
>
>                 Key: GIRAPH-61
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-61
>             Project: Giraph
>          Issue Type: Bug
>          Components: bsp
>    Affects Versions: 0.70.0
>            Reporter: Zhiwei Gu
>            Priority: Critical
>
> When there's early failure happens to a worker, the whole system will fail.
> Observed failed worker:
>    State: Creating RPC threads failed
>    Result: It will cause the worker fail, however, master has already 
> recorded and reserved these splits to this worker (identified by 
> InetAddress), thus although hadoop reschedule a mapper for this worker, the 
> master still waiting for the old worker's response, finally, the master will 
> fail.
> [Failed worker logs:]
> 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: 
> vertexRangeAssignmentsReadyChanged (vertex ranges are assigned)
> 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: 
> startSuperstep: Ready for computation on superstep 1 since worker selection 
> and vertex range assignments are done in 
> /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments
> 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: 
> getAggregatorValues: no aggregators in 
> /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir
>  on superstep 1
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: 
> totalMem=84213760 maxMem=2067988480 freeMem=65069808
> 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: 
> flush: starting...
> 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=102400 and 
> reduceRetainSize=102400
> 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running 
> child : java.lang.OutOfMemoryError: unable to create new native thread
>       at java.lang.Thread.start0(Native Method)
>       at java.lang.Thread.start(Thread.java:597)
>       at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
>       at java.lang.ProcessImpl.start(ProcessImpl.java:65)
>       at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
>       at org.apache.hadoop.util.Shell.run(Shell.java:182)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
>       at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
>       at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
>       at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
>       at 
> org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
>       at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:396)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
> [Failed master logs:]
> 2011-10-24 18:21:47,611 INFO org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  
> Sleeping for 5000 msecs and used 1 of 60 attempts.
> 2011-10-24 18:21:47,615 INFO org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:52,629 INFO org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  
> Sleeping for 5000 msecs and used 2 of 60 attempts.
> 2011-10-24 18:21:52,631 INFO org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:57,636 INFO org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  
> Sleeping for 5000 msecs and used 3 of 60 attempts.
> 2011-10-24 18:21:57,637 INFO org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:21:58,142 ERROR org.apache.zookeeper.ClientCnxn: Error while 
> calling watcher 
> java.lang.RuntimeException: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for 
> /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
>       at 
> org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:890)
>       at 
> org.apache.giraph.graph.BspServiceMaster.checkHealthyWorkerFailure(BspServiceMaster.java:1964)
>       at 
> org.apache.giraph.graph.BspServiceMaster.processEvent(BspServiceMaster.java:1995)
>       at org.apache.giraph.graph.BspService.process(BspService.java:1100)
>       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:488)
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/1/_superstepDir/0/_vertexRangeAssignments
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
>       at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>       at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921)
>       at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950)
>       at 
> org.apache.giraph.graph.BspService.getVertexRangeMap(BspService.java:861)
>       ... 4 more
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: Only found 399 responses of 400 needed to start superstep 0.  
> Sleeping for 5000 msecs and used 59 of 60 attempts.
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: No response from partition 279 (could be master)
> 2011-10-24 18:26:38,501 ERROR org.apache.giraph.graph.BspServiceMaster: 
> checkWorkers: Did not receive enough processes in time (only 399 of 400 
> required).  This occurs if you do not have enough map tasks available 
> simultaneously on your Hadoop instance to fulfill the number of requested 
> workers.
> 2011-10-24 18:26:38,501 FATAL org.apache.giraph.graph.BspServiceMaster: 
> coordinateSuperstep: Not enough healthy workers for superstep 0
> 2011-10-24 18:26:38,501 INFO org.apache.giraph.graph.BspServiceMaster: 
> setJobState: 
> {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on 
> superstep 0
> 2011-10-24 18:26:38,507 FATAL org.apache.giraph.graph.BspServiceMaster: 
> failJob: Killing job job_201108260911_842943
> 2011-10-24 18:26:38,567 ERROR org.apache.giraph.graph.MasterThread: 
> masterThread: Master algorithm failed: 
> java.lang.NullPointerException
>       at 
> org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
>       at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,567 FATAL org.apache.giraph.graph.GraphMapper: 
> uncaughtException: OverrideExceptionHandler on thread 
> org.apache.giraph.graph.MasterThread, msg = java.lang.NullPointerException, 
> exiting...
> java.lang.RuntimeException: java.lang.NullPointerException
>       at org.apache.giraph.graph.MasterThread.run(MasterThread.java:179)
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.giraph.graph.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1613)
>       at org.apache.giraph.graph.MasterThread.run(MasterThread.java:105)
> 2011-10-24 18:26:38,568 WARN org.apache.giraph.zk.ZooKeeperManager: 
> onlineZooKeeperServers: Forced a shutdown hook kill of the ZooKeeper process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail

Reply via email to