>From the hadoop docs it looks like preemption is turned off by default: https://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html
I have not changed that setting on the EMR clusters I've been testing on. I was able to get a toy version of this program working on EMR. The version I was testing with before was running against 720mb of gzipped data. > This is kind of strange. It appears that you lost your workers after superstep 0. Is preemption turned on for your cluster? On Mon, Jan 27, 2014 at 5:18 PM, Sam Garrett <[email protected]> wrote: > Hello. > > I've been working on a modified version of the clustered components > example code. I've been running Giraph in Amazon EMR successfully with > other versions of this code, however with this iteration I keep getting > this error on the master: > > 2014-01-27 21:58:45,756 INFO org.apache.giraph.master.MasterThread > (org.apache.giraph.master.MasterThread): masterThread: Coordination of > superstep 0 took 76.184 seconds ended with state THIS_SUPERSTEP_DONE and is > now on superstep 1 > 2014-01-27 21:59:17,811 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 567946 more msecs left before giving up. > 2014-01-27 21:59:47,817 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 537940 more msecs left before giving up. > 2014-01-27 22:00:17,822 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 507935 more msecs left before giving up. > 2014-01-27 22:00:47,828 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 477929 more msecs left before giving up. > 2014-01-27 22:01:17,834 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 447923 more msecs left before giving up. > 2014-01-27 22:01:47,840 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 417917 more msecs left before giving up. > 2014-01-27 22:02:17,848 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 387909 more msecs left before giving up. > 2014-01-27 22:02:47,853 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 357904 more msecs left before giving up. > 2014-01-27 22:03:17,858 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 327899 more msecs left before giving up. > 2014-01-27 22:03:47,863 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 297894 more msecs left before giving up. > 2014-01-27 22:04:17,870 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 267887 more msecs left before giving up. > 2014-01-27 22:04:47,876 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 237881 more msecs left before giving up. > 2014-01-27 22:05:17,881 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 207876 more msecs left before giving up. > 2014-01-27 22:05:47,887 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 177870 more msecs left before giving up. > 2014-01-27 22:06:17,892 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 147865 more msecs left before giving up. > 2014-01-27 22:06:47,897 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 117860 more msecs left before giving up. > 2014-01-27 22:07:17,930 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 87827 more msecs left before giving up. > 2014-01-27 22:07:47,936 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 57821 more msecs left before giving up. > 2014-01-27 22:08:17,941 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > 27816 more msecs left before giving up. > 2014-01-27 22:08:47,946 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0 > responses of 15 needed to start superstep 1. Reporting every 30000 msecs, > -2189 more msecs left before giving up. > 2014-01-27 22:08:47,946 ERROR org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): checkWorkers: Did not receive > enough processes in time (only 0 of 15 required) after waiting > 600000msecs). This occurs if you do not have enough map tasks available > simultaneously on your Hadoop instance to fulfill the number of requested > workers. > 2014-01-27 22:08:47,948 INFO org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): setJobState: > {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on > superstep 1 > 2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): failJob: Killing job > job_201401272153_0001 > 2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster > (org.apache.giraph.master.MasterThread): failJob: exception > java.lang.IllegalStateException: coordinateSuperstep: Not enough healthy > workers for superstep 1 > 2014-01-27 22:08:48,022 ERROR org.apache.giraph.master.MasterThread > (org.apache.giraph.master.MasterThread): masterThread: Master algorithm > failed with NullPointerException > java.lang.NullPointerException > at > org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354) > at > org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72) > at > org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563) > at org.apache.giraph.master.MasterThread.run(MasterThread.java:115) > 2014-01-27 22:08:48,023 FATAL org.apache.giraph.graph.GraphMapper > (org.apache.giraph.master.MasterThread): uncaughtException: > OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread, > msg = java.lang.NullPointerException, exiting... > java.lang.IllegalStateException: java.lang.NullPointerException > at org.apache.giraph.master.MasterThread.run(MasterThread.java:185) > Caused by: java.lang.NullPointerException > at > org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354) > at > org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72) > at > org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563) > at org.apache.giraph.master.MasterThread.run(MasterThread.java:115) > 2014-01-27 22:08:48,025 INFO org.apache.giraph.zk.ZooKeeperManager > (Thread-16): run: Shutdown hook started. > 2014-01-27 22:08:48,025 WARN org.apache.giraph.zk.ZooKeeperManager > (Thread-16): onlineZooKeeperServers: Forced a shutdown hook kill of the > ZooKeeper process. > > All of the worker nodes succesfully complete step 0. > > I tried adding more mappers (machines) to the cluster and that doesn't > seem to prevent the error. I've been using a forked version of the latest > trunk code on the apache giraph github repo. I'm not sure why the workers > would be able to communicate properly for step 0 and then fail to do so > when starting step 1. > > My current cluster configuration is 1 x m2.4xlarge for the master node and > then 15 x m1.large for the core nodes. I'm also using > giraph.useOutOfCoreMessages=true. > > For reference here is a log from one of the workers: > > 2014-01-27 21:57:29,532 INFO > org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main): > newStore: Created class > org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex > id class org.apache.hadoop.io.Text and message value class > org.apache.hadoop.io.Text and no combiner > 2014-01-27 21:57:29,532 INFO > org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main): > newStore: Created class > org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex > id class org.apache.hadoop.io.Text and message value class > org.apache.hadoop.io.Text and no combiner > 2014-01-27 21:57:29,550 WARN org.apache.giraph.bsp.BspService > (main-EventThread): process: Unknown and unprocessed event > (path=/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir, > type=NodeChildrenChanged, state=SyncConnected) > 2014-01-27 21:57:29,654 INFO org.apache.giraph.worker.BspServiceWorker > (main): registerHealth: Created my health node for attempt=0, superstep=0 > with > /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1 > and workerInfo= Worker(hostname=ip-10-9-140-146.ec2.internal, MRtaskID=1, > port=30001) > 2014-01-27 21:57:29,696 INFO org.apache.giraph.bsp.BspService > (main-EventThread): process: partitionAssignmentsReadyChanged (partitions > are assigned) > 2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker > (main): startSuperstep: Master(hostname=ip-10-238-165-32.ec2.internal, > MRtaskID=0, port=30000) > 2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker > (main): startSuperstep: Ready for computation on superstep 0 since worker > selection and vertex range assignments are done in > /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_addressesAndPartitions > 2014-01-27 21:57:29,715 INFO org.apache.giraph.comm.netty.NettyClient > (main): connectAllAddresses: Successfully added 0 connections, (0 total > connected) 0 failed, 0 failures total. > 2014-01-27 21:57:29,719 INFO org.apache.giraph.comm.netty.NettyClient > (main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0001, > MBytes/sec received = 0, MBytesSent = 0, MBytesReceived = 0, ave sent req > MBytes = 0, ave received req MBytes = 0, secs waited = 0.31 > 2014-01-27 21:57:29,726 INFO org.apache.giraph.worker.BspServiceWorker > (main): sendWorkerPartitions: Done sending all my partitions. > 2014-01-27 21:57:29,738 INFO org.apache.giraph.worker.BspServiceWorker > (main): exchangeVertexPartitions: Done with exchange. > 2014-01-27 21:57:29,745 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier > (main): waitForRequiredPermits: Waiting for 1 more tasks to send their > aggregator data, task ids: [4] > 2014-01-27 21:57:29,747 INFO org.apache.giraph.graph.GraphTaskManager > (main): execute: 15 partitions to process with 1 compute thread(s), > originally 1 thread(s) on superstep 0 > 2014-01-27 21:57:31,649 INFO > org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-3): > decode: Server window metrics MBytes/sec sent = 0.0014, MBytes/sec received > = 1.3229, MBytesSent = 0.042, MBytesReceived = 39.6883, ave sent req MBytes > = 0, ave received req MBytes = 0.0139, secs waited = 30.0 > 2014-01-27 21:58:04,507 INFO > org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-7): > decode: Server window metrics MBytes/sec sent = 0.0348, MBytes/sec received > = 7.4575, MBytesSent = 1.1421, MBytesReceived = 245.0474, ave sent req > MBytes = 0, ave received req MBytes = 0.0137, secs waited = 32.858 > 2014-01-27 21:58:05,725 INFO org.apache.giraph.graph.ComputeCallable > (compute-0): call: Completed 6 partitions, 9 remaining Memory > (free/total/max) = 766.17M / 1939.00M / 3641.00M > 2014-01-27 21:58:32,284 INFO org.apache.giraph.utils.ProgressableUtils > (main): waitFor: Future result not ready yet > java.util.concurrent.FutureTask@7c2bc94a > 2014-01-27 21:58:32,292 INFO org.apache.giraph.utils.ProgressableUtils > (main): waitFor: Waiting for > org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3ae3f711 > 2014-01-27 21:58:34,508 INFO > org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-1): > decode: Server window metrics MBytes/sec sent = 0.0325, MBytes/sec received > = 7.2723, MBytesSent = 0.9743, MBytesReceived = 218.1846, ave sent req > MBytes = 0, ave received req MBytes = 0.0191, secs waited = 30.001 > 2014-01-27 21:58:36,622 INFO org.apache.giraph.graph.ComputeCallable > (compute-0): call: Completed 14 partitions, 1 remaining Memory > (free/total/max) = 872.90M / 2989.00M / 3641.00M > 2014-01-27 21:58:38,073 INFO org.apache.giraph.graph.ComputeCallable > (compute-0): call: Computation took 68.32348 secs for 15 partitions on > superstep 0. Flushing started > 2014-01-27 21:58:38,095 INFO org.apache.giraph.worker.BspServiceWorker > (main): finishSuperstep: Waiting on all requests, superstep 0 Memory > (free/total/max) = 716.70M / 2989.00M / 3641.00M > 2014-01-27 21:58:38,095 INFO org.apache.giraph.comm.netty.NettyClient > (main): logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 5568 > open requests, waiting for it to be <= 0, MBytes/sec sent = 7.6591, > MBytes/sec received = 0.0349, MBytesSent = 466.4169, MBytesReceived = > 2.1229, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs > waited = 60.896 > 2014-01-27 21:58:38,101 INFO org.apache.giraph.comm.netty.NettyClient > (main): logInfoAboutOpenRequests: 2808 requests for taskId=8, 2641 requests > for taskId=11, 104 requests for taskId=3, 1 requests for taskId=5, 1 > requests for taskId=14, > 2014-01-27 21:58:43,728 INFO org.apache.giraph.comm.netty.NettyClient > (main): waitAllRequests: Finished all requests. MBytes/sec sent = 7.0106, > MBytes/sec received = 0.0329, MBytesSent = 466.4169, MBytesReceived = > 2.1919, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs > waited = 66.529 > 2014-01-27 21:58:43,728 INFO > org.apache.giraph.worker.WorkerAggregatorHandler (main): finishSuperstep: > Start gathering aggregators, workers will send their aggregated values once > they are done with superstep computation > 2014-01-27 21:58:43,768 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier > (main): waitForRequiredPermits: Waiting for 3 more tasks to send their > aggregator data, task ids: [11, 13, 14] > 2014-01-27 21:58:45,673 INFO org.apache.giraph.comm.netty.NettyClient > (main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0048, > MBytes/sec received = 0.0025, MBytesSent = 0, MBytesReceived = 0, ave sent > req MBytes = 0, ave received req MBytes = 0, secs waited = 0.005 > 2014-01-27 21:58:45,673 INFO org.apache.giraph.worker.BspServiceWorker > (main): finishSuperstep: Superstep 0, messages = 0 , message bytes = 0 , > Memory (free/total/max) = 1093.58M / 2989.00M / 3641.00M > 2014-01-27 21:58:45,682 INFO org.apache.giraph.worker.BspServiceWorker > (main): finishSuperstep: (waiting for rest of workers) WORKER_ONLY - > Attempt=0, Superstep=0 > 2014-01-27 21:58:45,721 INFO org.apache.giraph.bsp.BspService > (main-EventThread): process: superstepFinished signaled > 2014-01-27 21:58:45,744 INFO org.apache.giraph.worker.BspServiceWorker > (main): finishSuperstep: Completed superstep 0 with global stats > (vtx=364766,finVtx=0,edges=3815107,msgCount=0,msgBytesCount=0,haltComputation=false) > and classes > (computation=org.apache.giraph.examples.ReflexiveConnectedStringComponentsComputation,combiner=null) > 2014-01-27 21:58:45,744 INFO > org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main): > newStore: Created class > org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex > id class org.apache.hadoop.io.Text and message value class > org.apache.hadoop.io.Text and no combiner > 2014-01-27 21:58:52,752 ERROR org.apache.giraph.worker.BspServiceWorker > (main): unregisterHealth: Got failure, unregistering health on > /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1 > on superstep 1 > 2014-01-27 21:58:52,758 INFO org.apache.hadoop.mapred.Task (main): > Task:attempt_201401272153_0001_m_000001_0 is done. And is in the process of > commiting > 2014-01-27 21:58:54,846 INFO org.apache.hadoop.mapred.Task (main): Task > 'attempt_201401272153_0001_m_000001_0' done. > > Thank you for your time and help. > > -- > Sam > -- Sam Garrett ActionX, NYC, 260 820 0394
