Hi all,
In my giraph job, when I set the worker to be 200, it is ok, and
while set to 500, it will fail due to early stage OOM exception in one
(or more) workers. As this worker fails, other workers who wants to
talk with this worker will keep on waiting until tried 5 times, then
that worker will fail.
Have you ever faced such issue?
Best,
-z
Here is the exception,
2011-10-08 09:26:59,108 INFO org.apache.giraph.comm.RPCCommunications:
getRPCServer: Added jobToken Ident: 17 6a 6f 62 5f 32 30 31 31 30 38
32 36 30 39 31 31 5f 36 36 37 30 39 30, Pass: 12 26 1a f1 d2 51 e1 bf
2d 36 63 11 26 18 17 3d 53 b3 15 f6, Kind: mapreduce.job, Service:
job_201108260911_667090
2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,120 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
RpcDetailedActivityForPort31250 registered.
2011-10-08 09:26:59,121 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
RpcActivityForPort31250 registered.
2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server listener
on 31250: starting
2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0
on 31250: starting
2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1
on 31250: starting
2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2
on 31250: starting
2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3
on 31250: starting
2011-10-08 09:26:59,137 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9
on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler
10 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler
11 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler
12 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler
13 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler
14 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler
15 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler
16 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler
17 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler
18 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler
19 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler
20 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler
21 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler
22 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler
23 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler
24 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler
25 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler
26 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler
27 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler
28 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler
29 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler
30 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler
31 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler
32 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler
33 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler
34 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler
35 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler
36 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler
37 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler
38 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler
39 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler
40 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler
41 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler
42 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler
43 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler
44 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler
45 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler
46 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler
47 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler
48 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler
49 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler
50 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler
51 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler
52 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler
53 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler
54 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler
55 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler
56 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler
57 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler
58 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler
59 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler
60 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler
61 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler
62 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler
63 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler
64 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler
65 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler
66 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler
67 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler
68 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler
69 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler
70 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler
71 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler
72 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler
73 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler
74 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler
75 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler
76 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler
77 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler
78 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler
79 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler
80 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler
81 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler
82 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler
83 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler
84 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler
85 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler
86 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler
87 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler
88 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler
89 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler
90 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler
91 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler
92 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler
93 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler
94 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler
95 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler
96 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler
97 on 31250: starting
2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler
98 on 31250: starting
2011-10-08 09:26:59,161 INFO org.apache.giraph.comm.BasicRPCCommunications:
BasicRPCCommunications: Started RPC communication
server:gsta33033.tan.ygrid.yahoo.com/10.216.176.59:31250
<http://gsta33033.tan.ygrid.yahoo.com/10.216.176.59:31250> with 100 handlers
2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler
99 on 31250: starting
2011-10-08 09:27:05,234 INFO org.apache.hadoop.mapred.TaskLogsTruncater:
Initializing logs' truncater with mapRetainSize=102400 and
reduceRetainSize=102400
2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error running
child : java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
at java.security.AccessController.doPrivileged(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
at java.lang.ProcessImpl.start(ProcessImpl.java:65)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
at
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
at
org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
at
org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
at
org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
at
org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:255)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping MapTask metrics system...
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping metrics source ugi(org.apache.hadoop.security.UgiInstrumentation)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping metrics source jvm(org.apache.hadoop.metrics2.source.JvmMetricsSource)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping metrics source
RpcDetailedActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation$Detailed)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
Stopping metrics source
RpcActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl:
MapTask metrics system stopped.
--
Best Regards
Zhiwei Gu