Re: Superstep 0 finishes, but Superstep 1 doesn't start

Sam Garrett Wed, 29 Jan 2014 14:01:05 -0800

It looks like the unregisterHealth() method is being called on the workers:
https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/worker/BspServiceWorker.java#L651


The workers all have a similar log to this:

2014-01-27 21:58:52,752 ERROR org.apache.giraph.worker.BspServiceWorker
(main): unregisterHealth: Got failure, unregistering health on
/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1
on superstep 1
2014-01-27 21:58:52,758 INFO org.apache.hadoop.mapred.Task (main):
Task:attempt_201401272153_0001_m_000001_0 is done. And is in the process of
commiting
2014-01-27 21:58:54,846 INFO org.apache.hadoop.mapred.Task (main): Task
'attempt_201401272153_0001_m_000001_0' done.


On Wed, Jan 29, 2014 at 12:39 PM, Sam Garrett <[email protected]> wrote:

> From the hadoop docs it looks like preemption is turned off by default:
> https://hadoop.apache.org/docs/r1.2.1/fair_scheduler.html
>
> I have not changed that setting on the EMR clusters I've been testing on.
>
> I was able to get a toy version of this program working on EMR. The
> version I was testing with before was running against 720mb of gzipped data.
>
> > This is kind of strange. It appears that you lost your workers after
> superstep 0. Is preemption turned on for your cluster?
>
>
> On Mon, Jan 27, 2014 at 5:18 PM, Sam Garrett <[email protected]> wrote:
>
>> Hello.
>>
>> I've been working on a modified version of the clustered components
>> example code. I've been running Giraph in Amazon EMR successfully with
>> other versions of this code, however with this iteration I keep getting
>> this error on the master:
>>
>> 2014-01-27 21:58:45,756 INFO org.apache.giraph.master.MasterThread
>> (org.apache.giraph.master.MasterThread): masterThread: Coordination of
>> superstep 0 took 76.184 seconds ended with state THIS_SUPERSTEP_DONE and is
>> now on superstep 1
>> 2014-01-27 21:59:17,811 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 567946 more msecs left before giving up.
>> 2014-01-27 21:59:47,817 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 537940 more msecs left before giving up.
>> 2014-01-27 22:00:17,822 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 507935 more msecs left before giving up.
>> 2014-01-27 22:00:47,828 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 477929 more msecs left before giving up.
>> 2014-01-27 22:01:17,834 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 447923 more msecs left before giving up.
>> 2014-01-27 22:01:47,840 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 417917 more msecs left before giving up.
>> 2014-01-27 22:02:17,848 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 387909 more msecs left before giving up.
>> 2014-01-27 22:02:47,853 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 357904 more msecs left before giving up.
>> 2014-01-27 22:03:17,858 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 327899 more msecs left before giving up.
>> 2014-01-27 22:03:47,863 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 297894 more msecs left before giving up.
>> 2014-01-27 22:04:17,870 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 267887 more msecs left before giving up.
>> 2014-01-27 22:04:47,876 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 237881 more msecs left before giving up.
>> 2014-01-27 22:05:17,881 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 207876 more msecs left before giving up.
>> 2014-01-27 22:05:47,887 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 177870 more msecs left before giving up.
>> 2014-01-27 22:06:17,892 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 147865 more msecs left before giving up.
>> 2014-01-27 22:06:47,897 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 117860 more msecs left before giving up.
>> 2014-01-27 22:07:17,930 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 87827 more msecs left before giving up.
>> 2014-01-27 22:07:47,936 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 57821 more msecs left before giving up.
>> 2014-01-27 22:08:17,941 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> 27816 more msecs left before giving up.
>> 2014-01-27 22:08:47,946 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
>> responses of 15 needed to start superstep 1.  Reporting every 30000 msecs,
>> -2189 more msecs left before giving up.
>> 2014-01-27 22:08:47,946 ERROR org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): checkWorkers: Did not receive
>> enough processes in time (only 0 of 15 required) after waiting
>> 600000msecs).  This occurs if you do not have enough map tasks available
>> simultaneously on your Hadoop instance to fulfill the number of requested
>> workers.
>> 2014-01-27 22:08:47,948 INFO org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): setJobState:
>> {"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on
>> superstep 1
>> 2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): failJob: Killing job
>> job_201401272153_0001
>> 2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster
>> (org.apache.giraph.master.MasterThread): failJob: exception
>> java.lang.IllegalStateException: coordinateSuperstep: Not enough healthy
>> workers for superstep 1
>> 2014-01-27 22:08:48,022 ERROR org.apache.giraph.master.MasterThread
>> (org.apache.giraph.master.MasterThread): masterThread: Master algorithm
>> failed with NullPointerException
>> java.lang.NullPointerException
>>     at
>> org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354)
>>     at
>> org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72)
>>     at
>> org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563)
>>     at org.apache.giraph.master.MasterThread.run(MasterThread.java:115)
>> 2014-01-27 22:08:48,023 FATAL org.apache.giraph.graph.GraphMapper
>> (org.apache.giraph.master.MasterThread): uncaughtException:
>> OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread,
>> msg = java.lang.NullPointerException, exiting...
>> java.lang.IllegalStateException: java.lang.NullPointerException
>>     at org.apache.giraph.master.MasterThread.run(MasterThread.java:185)
>> Caused by: java.lang.NullPointerException
>>     at
>> org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354)
>>     at
>> org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72)
>>     at
>> org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563)
>>     at org.apache.giraph.master.MasterThread.run(MasterThread.java:115)
>> 2014-01-27 22:08:48,025 INFO org.apache.giraph.zk.ZooKeeperManager
>> (Thread-16): run: Shutdown hook started.
>> 2014-01-27 22:08:48,025 WARN org.apache.giraph.zk.ZooKeeperManager
>> (Thread-16): onlineZooKeeperServers: Forced a shutdown hook kill of the
>> ZooKeeper process.
>>
>> All of the worker nodes succesfully complete step 0.
>>
>> I tried adding more mappers (machines) to the cluster and that doesn't
>> seem to prevent the error. I've been using a forked version of the latest
>> trunk code on the apache giraph github repo. I'm not sure why the workers
>> would be able to communicate properly for step 0 and then fail to do so
>> when starting step 1.
>>
>> My current cluster configuration is 1 x m2.4xlarge for the master node
>> and then 15 x m1.large for the core nodes. I'm also using
>> giraph.useOutOfCoreMessages=true.
>>
>> For reference here is a log from one of the workers:
>>
>> 2014-01-27 21:57:29,532 INFO
>> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
>> newStore: Created class
>> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
>> id class org.apache.hadoop.io.Text and message value class
>> org.apache.hadoop.io.Text and no combiner
>> 2014-01-27 21:57:29,532 INFO
>> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
>> newStore: Created class
>> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
>> id class org.apache.hadoop.io.Text and message value class
>> org.apache.hadoop.io.Text and no combiner
>> 2014-01-27 21:57:29,550 WARN org.apache.giraph.bsp.BspService
>> (main-EventThread): process: Unknown and unprocessed event
>> (path=/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir,
>> type=NodeChildrenChanged, state=SyncConnected)
>> 2014-01-27 21:57:29,654 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): registerHealth: Created my health node for attempt=0, superstep=0
>> with
>> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1
>> and workerInfo= Worker(hostname=ip-10-9-140-146.ec2.internal, MRtaskID=1,
>> port=30001)
>> 2014-01-27 21:57:29,696 INFO org.apache.giraph.bsp.BspService
>> (main-EventThread): process: partitionAssignmentsReadyChanged (partitions
>> are assigned)
>> 2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): startSuperstep: Master(hostname=ip-10-238-165-32.ec2.internal,
>> MRtaskID=0, port=30000)
>> 2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): startSuperstep: Ready for computation on superstep 0 since worker
>> selection and vertex range assignments are done in
>> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_addressesAndPartitions
>> 2014-01-27 21:57:29,715 INFO org.apache.giraph.comm.netty.NettyClient
>> (main): connectAllAddresses: Successfully added 0 connections, (0 total
>> connected) 0 failed, 0 failures total.
>> 2014-01-27 21:57:29,719 INFO org.apache.giraph.comm.netty.NettyClient
>> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0001,
>> MBytes/sec received = 0, MBytesSent = 0, MBytesReceived = 0, ave sent req
>> MBytes = 0, ave received req MBytes = 0, secs waited = 0.31
>> 2014-01-27 21:57:29,726 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): sendWorkerPartitions: Done sending all my partitions.
>> 2014-01-27 21:57:29,738 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): exchangeVertexPartitions: Done with exchange.
>> 2014-01-27 21:57:29,745 INFO
>> org.apache.giraph.utils.TaskIdsPermitsBarrier (main):
>> waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator
>> data, task ids: [4]
>> 2014-01-27 21:57:29,747 INFO org.apache.giraph.graph.GraphTaskManager
>> (main): execute: 15 partitions to process with 1 compute thread(s),
>> originally 1 thread(s) on superstep 0
>> 2014-01-27 21:57:31,649 INFO
>> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-3):
>> decode: Server window metrics MBytes/sec sent = 0.0014, MBytes/sec received
>> = 1.3229, MBytesSent = 0.042, MBytesReceived = 39.6883, ave sent req MBytes
>> = 0, ave received req MBytes = 0.0139, secs waited = 30.0
>> 2014-01-27 21:58:04,507 INFO
>> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-7):
>> decode: Server window metrics MBytes/sec sent = 0.0348, MBytes/sec received
>> = 7.4575, MBytesSent = 1.1421, MBytesReceived = 245.0474, ave sent req
>> MBytes = 0, ave received req MBytes = 0.0137, secs waited = 32.858
>> 2014-01-27 21:58:05,725 INFO org.apache.giraph.graph.ComputeCallable
>> (compute-0): call: Completed 6 partitions, 9 remaining Memory
>> (free/total/max) = 766.17M / 1939.00M / 3641.00M
>> 2014-01-27 21:58:32,284 INFO org.apache.giraph.utils.ProgressableUtils
>> (main): waitFor: Future result not ready yet
>> java.util.concurrent.FutureTask@7c2bc94a
>> 2014-01-27 21:58:32,292 INFO org.apache.giraph.utils.ProgressableUtils
>> (main): waitFor: Waiting for
>> org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3ae3f711
>> 2014-01-27 21:58:34,508 INFO
>> org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-1):
>> decode: Server window metrics MBytes/sec sent = 0.0325, MBytes/sec received
>> = 7.2723, MBytesSent = 0.9743, MBytesReceived = 218.1846, ave sent req
>> MBytes = 0, ave received req MBytes = 0.0191, secs waited = 30.001
>> 2014-01-27 21:58:36,622 INFO org.apache.giraph.graph.ComputeCallable
>> (compute-0): call: Completed 14 partitions, 1 remaining Memory
>> (free/total/max) = 872.90M / 2989.00M / 3641.00M
>> 2014-01-27 21:58:38,073 INFO org.apache.giraph.graph.ComputeCallable
>> (compute-0): call: Computation took 68.32348 secs for 15 partitions on
>> superstep 0.  Flushing started
>> 2014-01-27 21:58:38,095 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): finishSuperstep: Waiting on all requests, superstep 0 Memory
>> (free/total/max) = 716.70M / 2989.00M / 3641.00M
>> 2014-01-27 21:58:38,095 INFO org.apache.giraph.comm.netty.NettyClient
>> (main): logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 5568
>> open requests, waiting for it to be <= 0, MBytes/sec sent = 7.6591,
>> MBytes/sec received = 0.0349, MBytesSent = 466.4169, MBytesReceived =
>> 2.1229, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs
>> waited = 60.896
>> 2014-01-27 21:58:38,101 INFO org.apache.giraph.comm.netty.NettyClient
>> (main): logInfoAboutOpenRequests: 2808 requests for taskId=8, 2641 requests
>> for taskId=11, 104 requests for taskId=3, 1 requests for taskId=5, 1
>> requests for taskId=14,
>> 2014-01-27 21:58:43,728 INFO org.apache.giraph.comm.netty.NettyClient
>> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 7.0106,
>> MBytes/sec received = 0.0329, MBytesSent = 466.4169, MBytesReceived =
>> 2.1919, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs
>> waited = 66.529
>> 2014-01-27 21:58:43,728 INFO
>> org.apache.giraph.worker.WorkerAggregatorHandler (main): finishSuperstep:
>> Start gathering aggregators, workers will send their aggregated values once
>> they are done with superstep computation
>> 2014-01-27 21:58:43,768 INFO
>> org.apache.giraph.utils.TaskIdsPermitsBarrier (main):
>> waitForRequiredPermits: Waiting for 3 more tasks to send their aggregator
>> data, task ids: [11, 13, 14]
>> 2014-01-27 21:58:45,673 INFO org.apache.giraph.comm.netty.NettyClient
>> (main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0048,
>> MBytes/sec received = 0.0025, MBytesSent = 0, MBytesReceived = 0, ave sent
>> req MBytes = 0, ave received req MBytes = 0, secs waited = 0.005
>> 2014-01-27 21:58:45,673 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): finishSuperstep: Superstep 0, messages = 0 , message bytes = 0 ,
>> Memory (free/total/max) = 1093.58M / 2989.00M / 3641.00M
>> 2014-01-27 21:58:45,682 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): finishSuperstep: (waiting for rest of workers) WORKER_ONLY -
>> Attempt=0, Superstep=0
>> 2014-01-27 21:58:45,721 INFO org.apache.giraph.bsp.BspService
>> (main-EventThread): process: superstepFinished signaled
>> 2014-01-27 21:58:45,744 INFO org.apache.giraph.worker.BspServiceWorker
>> (main): finishSuperstep: Completed superstep 0 with global stats
>> (vtx=364766,finVtx=0,edges=3815107,msgCount=0,msgBytesCount=0,haltComputation=false)
>> and classes
>> (computation=org.apache.giraph.examples.ReflexiveConnectedStringComponentsComputation,combiner=null)
>> 2014-01-27 21:58:45,744 INFO
>> org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
>> newStore: Created class
>> org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
>> id class org.apache.hadoop.io.Text and message value class
>> org.apache.hadoop.io.Text and no combiner
>> 2014-01-27 21:58:52,752 ERROR org.apache.giraph.worker.BspServiceWorker
>> (main): unregisterHealth: Got failure, unregistering health on
>> /_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1
>> on superstep 1
>> 2014-01-27 21:58:52,758 INFO org.apache.hadoop.mapred.Task (main):
>> Task:attempt_201401272153_0001_m_000001_0 is done. And is in the process of
>> commiting
>> 2014-01-27 21:58:54,846 INFO org.apache.hadoop.mapred.Task (main): Task
>> 'attempt_201401272153_0001_m_000001_0' done.
>>
>> Thank you for your time and help.
>>
>> --
>> Sam
>>
>
>
>
> --
> Sam Garrett
> ActionX, NYC, 260 820 0394
>



-- 
Sam Garrett
ActionX

Re: Superstep 0 finishes, but Superstep 1 doesn't start

Reply via email to