Hello.
I've been working on a modified version of the clustered components example
code. I've been running Giraph in Amazon EMR successfully with other
versions of this code, however with this iteration I keep getting this
error on the master:
2014-01-27 21:58:45,756 INFO org.apache.giraph.master.MasterThread
(org.apache.giraph.master.MasterThread): masterThread: Coordination of
superstep 0 took 76.184 seconds ended with state THIS_SUPERSTEP_DONE and is
now on superstep 1
2014-01-27 21:59:17,811 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
567946 more msecs left before giving up.
2014-01-27 21:59:47,817 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
537940 more msecs left before giving up.
2014-01-27 22:00:17,822 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
507935 more msecs left before giving up.
2014-01-27 22:00:47,828 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
477929 more msecs left before giving up.
2014-01-27 22:01:17,834 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
447923 more msecs left before giving up.
2014-01-27 22:01:47,840 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
417917 more msecs left before giving up.
2014-01-27 22:02:17,848 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
387909 more msecs left before giving up.
2014-01-27 22:02:47,853 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
357904 more msecs left before giving up.
2014-01-27 22:03:17,858 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
327899 more msecs left before giving up.
2014-01-27 22:03:47,863 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
297894 more msecs left before giving up.
2014-01-27 22:04:17,870 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
267887 more msecs left before giving up.
2014-01-27 22:04:47,876 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
237881 more msecs left before giving up.
2014-01-27 22:05:17,881 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
207876 more msecs left before giving up.
2014-01-27 22:05:47,887 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
177870 more msecs left before giving up.
2014-01-27 22:06:17,892 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
147865 more msecs left before giving up.
2014-01-27 22:06:47,897 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
117860 more msecs left before giving up.
2014-01-27 22:07:17,930 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
87827 more msecs left before giving up.
2014-01-27 22:07:47,936 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
57821 more msecs left before giving up.
2014-01-27 22:08:17,941 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
27816 more msecs left before giving up.
2014-01-27 22:08:47,946 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Only found 0
responses of 15 needed to start superstep 1. Reporting every 30000 msecs,
-2189 more msecs left before giving up.
2014-01-27 22:08:47,946 ERROR org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): checkWorkers: Did not receive
enough processes in time (only 0 of 15 required) after waiting
600000msecs). This occurs if you do not have enough map tasks available
simultaneously on your Hadoop instance to fulfill the number of requested
workers.
2014-01-27 22:08:47,948 INFO org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): setJobState:
{"_stateKey":"FAILED","_applicationAttemptKey":-1,"_superstepKey":-1} on
superstep 1
2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): failJob: Killing job
job_201401272153_0001
2014-01-27 22:08:47,954 FATAL org.apache.giraph.master.BspServiceMaster
(org.apache.giraph.master.MasterThread): failJob: exception
java.lang.IllegalStateException: coordinateSuperstep: Not enough healthy
workers for superstep 1
2014-01-27 22:08:48,022 ERROR org.apache.giraph.master.MasterThread
(org.apache.giraph.master.MasterThread): masterThread: Master algorithm
failed with NullPointerException
java.lang.NullPointerException
at
org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354)
at
org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72)
at
org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:115)
2014-01-27 22:08:48,023 FATAL org.apache.giraph.graph.GraphMapper
(org.apache.giraph.master.MasterThread): uncaughtException:
OverrideExceptionHandler on thread org.apache.giraph.master.MasterThread,
msg = java.lang.NullPointerException, exiting...
java.lang.IllegalStateException: java.lang.NullPointerException
at org.apache.giraph.master.MasterThread.run(MasterThread.java:185)
Caused by: java.lang.NullPointerException
at
org.apache.giraph.comm.netty.NettyClient.connectAllAddresses(NettyClient.java:354)
at
org.apache.giraph.comm.netty.NettyMasterClient.openConnections(NettyMasterClient.java:72)
at
org.apache.giraph.master.BspServiceMaster.coordinateSuperstep(BspServiceMaster.java:1563)
at org.apache.giraph.master.MasterThread.run(MasterThread.java:115)
2014-01-27 22:08:48,025 INFO org.apache.giraph.zk.ZooKeeperManager
(Thread-16): run: Shutdown hook started.
2014-01-27 22:08:48,025 WARN org.apache.giraph.zk.ZooKeeperManager
(Thread-16): onlineZooKeeperServers: Forced a shutdown hook kill of the
ZooKeeper process.
All of the worker nodes succesfully complete step 0.
I tried adding more mappers (machines) to the cluster and that doesn't seem
to prevent the error. I've been using a forked version of the latest trunk
code on the apache giraph github repo. I'm not sure why the workers would
be able to communicate properly for step 0 and then fail to do so when
starting step 1.
My current cluster configuration is 1 x m2.4xlarge for the master node and
then 15 x m1.large for the core nodes. I'm also using
giraph.useOutOfCoreMessages=true.
For reference here is a log from one of the workers:
2014-01-27 21:57:29,532 INFO
org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
newStore: Created class
org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
id class org.apache.hadoop.io.Text and message value class
org.apache.hadoop.io.Text and no combiner
2014-01-27 21:57:29,532 INFO
org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
newStore: Created class
org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
id class org.apache.hadoop.io.Text and message value class
org.apache.hadoop.io.Text and no combiner
2014-01-27 21:57:29,550 WARN org.apache.giraph.bsp.BspService
(main-EventThread): process: Unknown and unprocessed event
(path=/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir,
type=NodeChildrenChanged, state=SyncConnected)
2014-01-27 21:57:29,654 INFO org.apache.giraph.worker.BspServiceWorker
(main): registerHealth: Created my health node for attempt=0, superstep=0
with
/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1
and workerInfo= Worker(hostname=ip-10-9-140-146.ec2.internal, MRtaskID=1,
port=30001)
2014-01-27 21:57:29,696 INFO org.apache.giraph.bsp.BspService
(main-EventThread): process: partitionAssignmentsReadyChanged (partitions
are assigned)
2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker
(main): startSuperstep: Master(hostname=ip-10-238-165-32.ec2.internal,
MRtaskID=0, port=30000)
2014-01-27 21:57:29,715 INFO org.apache.giraph.worker.BspServiceWorker
(main): startSuperstep: Ready for computation on superstep 0 since worker
selection and vertex range assignments are done in
/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_addressesAndPartitions
2014-01-27 21:57:29,715 INFO org.apache.giraph.comm.netty.NettyClient
(main): connectAllAddresses: Successfully added 0 connections, (0 total
connected) 0 failed, 0 failures total.
2014-01-27 21:57:29,719 INFO org.apache.giraph.comm.netty.NettyClient
(main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0001,
MBytes/sec received = 0, MBytesSent = 0, MBytesReceived = 0, ave sent req
MBytes = 0, ave received req MBytes = 0, secs waited = 0.31
2014-01-27 21:57:29,726 INFO org.apache.giraph.worker.BspServiceWorker
(main): sendWorkerPartitions: Done sending all my partitions.
2014-01-27 21:57:29,738 INFO org.apache.giraph.worker.BspServiceWorker
(main): exchangeVertexPartitions: Done with exchange.
2014-01-27 21:57:29,745 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier
(main): waitForRequiredPermits: Waiting for 1 more tasks to send their
aggregator data, task ids: [4]
2014-01-27 21:57:29,747 INFO org.apache.giraph.graph.GraphTaskManager
(main): execute: 15 partitions to process with 1 compute thread(s),
originally 1 thread(s) on superstep 0
2014-01-27 21:57:31,649 INFO
org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-3):
decode: Server window metrics MBytes/sec sent = 0.0014, MBytes/sec received
= 1.3229, MBytesSent = 0.042, MBytesReceived = 39.6883, ave sent req MBytes
= 0, ave received req MBytes = 0.0139, secs waited = 30.0
2014-01-27 21:58:04,507 INFO
org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-7):
decode: Server window metrics MBytes/sec sent = 0.0348, MBytes/sec received
= 7.4575, MBytesSent = 1.1421, MBytesReceived = 245.0474, ave sent req
MBytes = 0, ave received req MBytes = 0.0137, secs waited = 32.858
2014-01-27 21:58:05,725 INFO org.apache.giraph.graph.ComputeCallable
(compute-0): call: Completed 6 partitions, 9 remaining Memory
(free/total/max) = 766.17M / 1939.00M / 3641.00M
2014-01-27 21:58:32,284 INFO org.apache.giraph.utils.ProgressableUtils
(main): waitFor: Future result not ready yet
java.util.concurrent.FutureTask@7c2bc94a
2014-01-27 21:58:32,292 INFO org.apache.giraph.utils.ProgressableUtils
(main): waitFor: Waiting for
org.apache.giraph.utils.ProgressableUtils$FutureWaitable@3ae3f711
2014-01-27 21:58:34,508 INFO
org.apache.giraph.comm.netty.handler.RequestDecoder (netty-server-exec-1):
decode: Server window metrics MBytes/sec sent = 0.0325, MBytes/sec received
= 7.2723, MBytesSent = 0.9743, MBytesReceived = 218.1846, ave sent req
MBytes = 0, ave received req MBytes = 0.0191, secs waited = 30.001
2014-01-27 21:58:36,622 INFO org.apache.giraph.graph.ComputeCallable
(compute-0): call: Completed 14 partitions, 1 remaining Memory
(free/total/max) = 872.90M / 2989.00M / 3641.00M
2014-01-27 21:58:38,073 INFO org.apache.giraph.graph.ComputeCallable
(compute-0): call: Computation took 68.32348 secs for 15 partitions on
superstep 0. Flushing started
2014-01-27 21:58:38,095 INFO org.apache.giraph.worker.BspServiceWorker
(main): finishSuperstep: Waiting on all requests, superstep 0 Memory
(free/total/max) = 716.70M / 2989.00M / 3641.00M
2014-01-27 21:58:38,095 INFO org.apache.giraph.comm.netty.NettyClient
(main): logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 5568
open requests, waiting for it to be <= 0, MBytes/sec sent = 7.6591,
MBytes/sec received = 0.0349, MBytesSent = 466.4169, MBytesReceived =
2.1229, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs
waited = 60.896
2014-01-27 21:58:38,101 INFO org.apache.giraph.comm.netty.NettyClient
(main): logInfoAboutOpenRequests: 2808 requests for taskId=8, 2641 requests
for taskId=11, 104 requests for taskId=3, 1 requests for taskId=5, 1
requests for taskId=14,
2014-01-27 21:58:43,728 INFO org.apache.giraph.comm.netty.NettyClient
(main): waitAllRequests: Finished all requests. MBytes/sec sent = 7.0106,
MBytes/sec received = 0.0329, MBytesSent = 466.4169, MBytesReceived =
2.1919, ave sent req MBytes = 0.0026, ave received req MBytes = 0, secs
waited = 66.529
2014-01-27 21:58:43,728 INFO
org.apache.giraph.worker.WorkerAggregatorHandler (main): finishSuperstep:
Start gathering aggregators, workers will send their aggregated values once
they are done with superstep computation
2014-01-27 21:58:43,768 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier
(main): waitForRequiredPermits: Waiting for 3 more tasks to send their
aggregator data, task ids: [11, 13, 14]
2014-01-27 21:58:45,673 INFO org.apache.giraph.comm.netty.NettyClient
(main): waitAllRequests: Finished all requests. MBytes/sec sent = 0.0048,
MBytes/sec received = 0.0025, MBytesSent = 0, MBytesReceived = 0, ave sent
req MBytes = 0, ave received req MBytes = 0, secs waited = 0.005
2014-01-27 21:58:45,673 INFO org.apache.giraph.worker.BspServiceWorker
(main): finishSuperstep: Superstep 0, messages = 0 , message bytes = 0 ,
Memory (free/total/max) = 1093.58M / 2989.00M / 3641.00M
2014-01-27 21:58:45,682 INFO org.apache.giraph.worker.BspServiceWorker
(main): finishSuperstep: (waiting for rest of workers) WORKER_ONLY -
Attempt=0, Superstep=0
2014-01-27 21:58:45,721 INFO org.apache.giraph.bsp.BspService
(main-EventThread): process: superstepFinished signaled
2014-01-27 21:58:45,744 INFO org.apache.giraph.worker.BspServiceWorker
(main): finishSuperstep: Completed superstep 0 with global stats
(vtx=364766,finVtx=0,edges=3815107,msgCount=0,msgBytesCount=0,haltComputation=false)
and classes
(computation=org.apache.giraph.examples.ReflexiveConnectedStringComponentsComputation,combiner=null)
2014-01-27 21:58:45,744 INFO
org.apache.giraph.comm.messages.InMemoryMessageStoreFactory (main):
newStore: Created class
org.apache.giraph.comm.messages.ByteArrayMessagesPerVertexStore for vertex
id class org.apache.hadoop.io.Text and message value class
org.apache.hadoop.io.Text and no combiner
2014-01-27 21:58:52,752 ERROR org.apache.giraph.worker.BspServiceWorker
(main): unregisterHealth: Got failure, unregistering health on
/_hadoopBsp/job_201401272153_0001/_applicationAttemptsDir/0/_superstepDir/0/_workerHealthyDir/ip-10-9-140-146.ec2.internal_1
on superstep 1
2014-01-27 21:58:52,758 INFO org.apache.hadoop.mapred.Task (main):
Task:attempt_201401272153_0001_m_000001_0 is done. And is in the process of
commiting
2014-01-27 21:58:54,846 INFO org.apache.hadoop.mapred.Task (main): Task
'attempt_201401272153_0001_m_000001_0' done.
Thank you for your time and help.