It looks like the unregisterHealth() method is being called on the workers: https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/worker/BspServiceWorker.java#L651
The workers all have a similar log to this: 2014-01-27 21:58:52,752 ERROR org.apache.giraph.worker.BspServiceWorker (main): unregisterHealth: Got failure, unregistering health on /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1 on superstep 1 2014-01-27 21:58:52,758 INFO org.apache.hadoop.mapred.Task (main): Task:attempt_201401272153_0001_m_000001_0 is done. And is in the process of commiting 2014-01-27 21:58:54,846 INFO org.apache.hadoop.mapred.Task (main): Task 'attempt_201401272153_0001_m_000001_0' done. On Wed, Jan 29, 2014 at 12:39 PM, Sam Garrett <[email protected]> wrote: > From the hadoop docs it looks like preemption is turned off by default: > https://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html > > I have not changed that setting on the EMR clusters I've been testing on. > > I was able to get a toy version of this program working on EMR. The > version I was testing with before was running against 720mb of gzipped data. > > > This is kind of strange. It appears that you lost your workers after > superstep 0. Is preemption turned on for your cluster? > > > On Mon, Jan 27, 2014 at 5:18 PM, Sam Garrett <[email protected]> wrote: > >> Hello. >> >> I've been working on a modified version of the clustered components >> example code. I've been running Giraph in Amazon EMR successfully with >> other versions of this code, however with this iteration I keep getting >> this error on the master: >> >> 2014-01-27 21:58:45,756 INFO org.apache.giraph.master.MasterThread >> (org.apache.giraph.master.MasterThread): masterThread: Coordination of >> superstep 0 took 76.184 seconds ended with state THIS_SUPERSTEP_DONE and is >> now on superstep 1 >> 2014-01-27 21:59:17,811 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 567946 more msecs left before giving up. >> 2014-01-27 21:59:47,817 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 537940 more msecs left before giving up. >> 2014-01-27 22:00:17,822 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 507935 more msecs left before giving up. >> 2014-01-27 22:00:47,828 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 477929 more msecs left before giving up. >> 2014-01-27 22:01:17,834 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 447923 more msecs left before giving up. >> 2014-01-27 22:01:47,840 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 417917 more msecs left before giving up. >> 2014-01-27 22:02:17,848 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 387909 more msecs left before giving up. >> 2014-01-27 22:02:47,853 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 357904 more msecs left before giving up. >> 2014-01-27 22:03:17,858 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 327899 more msecs left before giving up. >> 2014-01-27 22:03:47,863 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 297894 more msecs left before giving up. >> 2014-01-27 22:04:17,870 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 267887 more msecs left before giving up. >> 2014-01-27 22:04:47,876 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 237881 more msecs left before giving up. >> 2014-01-27 22:05:17,881 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 207876 more msecs left before giving up. >> 2014-01-27 22:05:47,887 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 177870 more msecs left before giving up. >> 2014-01-27 22:06:17,892 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 147865 more msecs left before giving up. >> 2014-01-27 22:06:47,897 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 117860 more msecs left before giving up. >> 2014-01-27 22:07:17,930 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 87827 more msecs left before giving up. >> 2014-01-27 22:07:47,936 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 57821 more msecs left before giving up. >> 2014-01-27 22:08:17,941 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> 27816 more msecs left before giving up. >> 2014-01-27 22:08:47,946 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 >> responses of 15 needed to start superstep 1. Reporting every 30000 msecs, >> -2189 more msecs left before giving up. >> 2014-01-27 22:08:47,946 ERROR org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): checkWorkers: Did not receive >> enough processes in time (only 0 of 15 required) after waiting >> 600000msecs). This occurs if you do not have enough map tasks available >> simultaneously on your Hadoop instance to fulfill the number of requested >> workers. >> 2014-01-27 22:08:47,948 INFO org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): setJobState: >> {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on >> superstep 1 >> 2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): failJob: Killing job >> job_201401272153_0001 >> 2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster >> (org.apache.giraph.master.MasterThread): failJob: exception >> java.lang.IllegalStateException: coordinateSuperstep: Not enough healthy >> workers for superstep 1 >> 2014-01-27 22:08:48,022 ERROR org.apache.giraph.master.MasterThread >> (org.apache.giraph.master.MasterThread): masterThread: Master algorithm >> failed with NullPointerException >> java.lang.NullPointerException >> at >> org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354) >> at >> org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72) >> at >> org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563) >> at org.apache.giraph.master.MasterThread.run(MasterThread.java:115) >> 2014-01-27 22:08:48,023 FATAL org.apache.giraph.graph.GraphMapper >> (org.apache.giraph.master.MasterThread): uncaughtException: >> OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, >> msg = java.lang.NullPointerException, exiting... >> java.lang.IllegalStateException: java.lang.NullPointerException >> at org.apache.giraph.master.MasterThread.run(MasterThread.java:185) >> Caused by: java.lang.NullPointerException >> at >> org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354) >> at >> org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72) >> at >> org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563) >> at org.apache.giraph.master.MasterThread.run(MasterThread.java:115) >> 2014-01-27 22:08:48,025 INFO org.apache.giraph.zk.ZooKeeperManager >> (Thread-16): run: Shutdown hook started. >> 2014-01-27 22:08:48,025 WARN org.apache.giraph.zk.ZooKeeperManager >> (Thread-16): onlineZooKeeperServers: Forced a shutdown hook kill of the >> ZooKeeper process. >> >> All of the worker nodes succesfully complete step 0. >> >> I tried adding more mappers (machines) to the cluster and that doesn't >> seem to prevent the error. I've been using a forked version of the latest >> trunk code on the apache giraph github repo. I'm not sure why the workers >> would be able to communicate properly for step 0 and then fail to do so >> when starting step 1. >> >> My current cluster configuration is 1 x m2.4xlarge for the master node >> and then 15 x m1.large for the core nodes. I'm also using >> giraph.useOutOfCoreMessages=true. >> >> For reference here is a log from one of the workers: >> >> 2014-01-27 21:57:29,532 INFO >> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main): >> newStore: Created class >> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex >> id class org.apache.hadoop.io.Text and message value class >> org.apache.hadoop.io.Text and no combiner >> 2014-01-27 21:57:29,532 INFO >> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main): >> newStore: Created class >> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex >> id class org.apache.hadoop.io.Text and message value class >> org.apache.hadoop.io.Text and no combiner >> 2014-01-27 21:57:29,550 WARN org.apache.giraph.bsp.BspService >> (main-EventThread): process: Unknown and unprocessed event >> (path=/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir, >> type=NodeChildrenChanged, state=SyncConnected) >> 2014-01-27 21:57:29,654 INFO org.apache.giraph.worker.BspServiceWorker >> (main): registerHealth: Created my health node for attempt=0, superstep=0 >> with >> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1 >> and workerInfo= Worker(hostname=ip-10-9-140-146.ec2.internal, MRtaskID=1, >> port=30001) >> 2014-01-27 21:57:29,696 INFO org.apache.giraph.bsp.BspService >> (main-EventThread): process: partitionAssignmentsReadyChanged (partitions >> are assigned) >> 2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker >> (main): startSuperstep: Master(hostname=ip-10-238-165-32.ec2.internal, >> MRtaskID=0, port=30000) >> 2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker >> (main): startSuperstep: Ready for computation on superstep 0 since worker >> selection and vertex range assignments are done in >> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_addressesAndPartitions >> 2014-01-27 21:57:29,715 INFO org.apache.giraph.comm.netty.NettyClient >> (main): connectAllAddresses: Successfully added 0 connections, (0 total >> connected) 0 failed, 0 failures total. >> 2014-01-27 21:57:29,719 INFO org.apache.giraph.comm.netty.NettyClient >> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0001, >> MBytes/sec received = 0, MBytesSent = 0, MBytesReceived = 0, ave sent req >> MBytes = 0, ave received req MBytes = 0, secs waited = 0.31 >> 2014-01-27 21:57:29,726 INFO org.apache.giraph.worker.BspServiceWorker >> (main): sendWorkerPartitions: Done sending all my partitions. >> 2014-01-27 21:57:29,738 INFO org.apache.giraph.worker.BspServiceWorker >> (main): exchangeVertexPartitions: Done with exchange. >> 2014-01-27 21:57:29,745 INFO >> org.apache.giraph.utils.TaskIdsPermitsBarrier (main): >> waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator >> data, task ids: [4] >> 2014-01-27 21:57:29,747 INFO org.apache.giraph.graph.GraphTaskManager >> (main): execute: 15 partitions to process with 1 compute thread(s), >> originally 1 thread(s) on superstep 0 >> 2014-01-27 21:57:31,649 INFO >> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-3): >> decode: Server window metrics MBytes/sec sent = 0.0014, MBytes/sec received >> = 1.3229, MBytesSent = 0.042, MBytesReceived = 39.6883, ave sent req MBytes >> = 0, ave received req MBytes = 0.0139, secs waited = 30.0 >> 2014-01-27 21:58:04,507 INFO >> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-7): >> decode: Server window metrics MBytes/sec sent = 0.0348, MBytes/sec received >> = 7.4575, MBytesSent = 1.1421, MBytesReceived = 245.0474, ave sent req >> MBytes = 0, ave received req MBytes = 0.0137, secs waited = 32.858 >> 2014-01-27 21:58:05,725 INFO org.apache.giraph.graph.ComputeCallable >> (compute-0): call: Completed 6 partitions, 9 remaining Memory >> (free/total/max) = 766.17M / 1939.00M / 3641.00M >> 2014-01-27 21:58:32,284 INFO org.apache.giraph.utils.ProgressableUtils >> (main): waitFor: Future result not ready yet >> java.util.concurrent.FutureTask@7c2bc94a >> 2014-01-27 21:58:32,292 INFO org.apache.giraph.utils.ProgressableUtils >> (main): waitFor: Waiting for >> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3ae3f711 >> 2014-01-27 21:58:34,508 INFO >> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-1): >> decode: Server window metrics MBytes/sec sent = 0.0325, MBytes/sec received >> = 7.2723, MBytesSent = 0.9743, MBytesReceived = 218.1846, ave sent req >> MBytes = 0, ave received req MBytes = 0.0191, secs waited = 30.001 >> 2014-01-27 21:58:36,622 INFO org.apache.giraph.graph.ComputeCallable >> (compute-0): call: Completed 14 partitions, 1 remaining Memory >> (free/total/max) = 872.90M / 2989.00M / 3641.00M >> 2014-01-27 21:58:38,073 INFO org.apache.giraph.graph.ComputeCallable >> (compute-0): call: Computation took 68.32348 secs for 15 partitions on >> superstep 0. Flushing started >> 2014-01-27 21:58:38,095 INFO org.apache.giraph.worker.BspServiceWorker >> (main): finishSuperstep: Waiting on all requests, superstep 0 Memory >> (free/total/max) = 716.70M / 2989.00M / 3641.00M >> 2014-01-27 21:58:38,095 INFO org.apache.giraph.comm.netty.NettyClient >> (main): logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 5568 >> open requests, waiting for it to be <= 0, MBytes/sec sent = 7.6591, >> MBytes/sec received = 0.0349, MBytesSent = 466.4169, MBytesReceived = >> 2.1229, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs >> waited = 60.896 >> 2014-01-27 21:58:38,101 INFO org.apache.giraph.comm.netty.NettyClient >> (main): logInfoAboutOpenRequests: 2808 requests for taskId=8, 2641 requests >> for taskId=11, 104 requests for taskId=3, 1 requests for taskId=5, 1 >> requests for taskId=14, >> 2014-01-27 21:58:43,728 INFO org.apache.giraph.comm.netty.NettyClient >> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 7.0106, >> MBytes/sec received = 0.0329, MBytesSent = 466.4169, MBytesReceived = >> 2.1919, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs >> waited = 66.529 >> 2014-01-27 21:58:43,728 INFO >> org.apache.giraph.worker.WorkerAggregatorHandler (main): finishSuperstep: >> Start gathering aggregators, workers will send their aggregated values once >> they are done with superstep computation >> 2014-01-27 21:58:43,768 INFO >> org.apache.giraph.utils.TaskIdsPermitsBarrier (main): >> waitForRequiredPermits: Waiting for 3 more tasks to send their aggregator >> data, task ids: [11, 13, 14] >> 2014-01-27 21:58:45,673 INFO org.apache.giraph.comm.netty.NettyClient >> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0048, >> MBytes/sec received = 0.0025, MBytesSent = 0, MBytesReceived = 0, ave sent >> req MBytes = 0, ave received req MBytes = 0, secs waited = 0.005 >> 2014-01-27 21:58:45,673 INFO org.apache.giraph.worker.BspServiceWorker >> (main): finishSuperstep: Superstep 0, messages = 0 , message bytes = 0 , >> Memory (free/total/max) = 1093.58M / 2989.00M / 3641.00M >> 2014-01-27 21:58:45,682 INFO org.apache.giraph.worker.BspServiceWorker >> (main): finishSuperstep: (waiting for rest of workers) WORKER_ONLY - >> Attempt=0, Superstep=0 >> 2014-01-27 21:58:45,721 INFO org.apache.giraph.bsp.BspService >> (main-EventThread): process: superstepFinished signaled >> 2014-01-27 21:58:45,744 INFO org.apache.giraph.worker.BspServiceWorker >> (main): finishSuperstep: Completed superstep 0 with global stats >> (vtx=364766,finVtx=0,edges=3815107,msgCount=0,msgBytesCount=0,haltComputation=false) >> and classes >> (computation=org.apache.giraph.examples.ReflexiveConnectedStringComponentsComputation,combiner=null) >> 2014-01-27 21:58:45,744 INFO >> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main): >> newStore: Created class >> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex >> id class org.apache.hadoop.io.Text and message value class >> org.apache.hadoop.io.Text and no combiner >> 2014-01-27 21:58:52,752 ERROR org.apache.giraph.worker.BspServiceWorker >> (main): unregisterHealth: Got failure, unregistering health on >> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1 >> on superstep 1 >> 2014-01-27 21:58:52,758 INFO org.apache.hadoop.mapred.Task (main): >> Task:attempt_201401272153_0001_m_000001_0 is done. And is in the process of >> commiting >> 2014-01-27 21:58:54,846 INFO org.apache.hadoop.mapred.Task (main): Task >> 'attempt_201401272153_0001_m_000001_0' done. >> >> Thank you for your time and help. >> >> -- >> Sam >> > > > > -- > Sam Garrett > ActionX, NYC, 260 820 0394 > -- Sam Garrett ActionX
