Hi Zhiwei,

The issue (known) is basically from here:

2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:597)
        at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
        at java.security.AccessController.doPrivileged(Native Method)

It has been addressed to in GIRAPH-12 (https://issues.apache.org/jira/browse/GIRAPH-12).

<snip>
Currently every worker will start up a thread to communicate with every other workers. Hadoop RPC is used for communication. For instance if there are 400 workers, each worker will create 400 threads. This ends up using a lot of memory on the stack per worker, even with the option

-Dmapred.child.java.opts="-Xss64k".
</snip>


It would be good if you could try the latest Apache Giraph instead of the older one at Yahoo!, then you need to set GiraphJob.MSG_NUM_FLUSH_THREADS (giraph.msgNumFlushThreads) to a value that won't cause you to run out of stack space.

Avery

On 10/10/11 11:08 AM, Zhiwei Gu wrote:
Hi all,
In my giraph job, when I set the worker to be 200, it is ok, and while set to 500, it will fail due to early stage OOM exception in one (or more) workers. As this worker fails, other workers who wants to talk with this worker will keep on waiting until tried 5 times, then that worker will fail.

Have you ever faced such issue?

Best,
-z


Here is the exception,
2011-10-08 09:26:59,108 INFO org.apache.giraph.comm.RPCCommunications: getRPCServer: Added jobToken Ident: 17 6a 6f 62 5f 32 30 31 31 30 38 32 36 30 39 31 31 5f 36 36 37 30 39 30, Pass: 12 26 1a f1 d2 51 e1 bf 2d 36 63 11 26 18 17 3d 53 b3 15 f6, Kind: mapreduce.job, Service: job_201108260911_667090
2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,116 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,117 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
2011-10-08 09:26:59,120 INFO 
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source 
RpcDetailedActivityForPort31250 registered.
2011-10-08 09:26:59,121 INFO 
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source 
RpcActivityForPort31250 registered.
2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2011-10-08 09:26:59,123 INFO org.apache.hadoop.ipc.Server: IPC Server listener 
on 31250: starting
2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 31250: starting
2011-10-08 09:26:59,127 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 31250: starting
2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 
on 31250: starting
2011-10-08 09:26:59,133 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 
on 31250: starting
2011-10-08 09:26:59,137 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 
on 31250: starting
2011-10-08 09:26:59,144 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 
on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
10 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
11 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
12 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
13 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
14 on 31250: starting
2011-10-08 09:26:59,145 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
15 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
16 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
17 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
18 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
19 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
20 on 31250: starting
2011-10-08 09:26:59,146 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
21 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
22 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
23 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
24 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
25 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
26 on 31250: starting
2011-10-08 09:26:59,147 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
27 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
28 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
29 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
30 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
31 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
32 on 31250: starting
2011-10-08 09:26:59,148 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
33 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
34 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
35 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
36 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
37 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
38 on 31250: starting
2011-10-08 09:26:59,149 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
39 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
40 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
41 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
42 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
43 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
44 on 31250: starting
2011-10-08 09:26:59,150 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
45 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
46 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
47 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
48 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
49 on 31250: starting
2011-10-08 09:26:59,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
50 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
51 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
52 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
53 on 31250: starting
2011-10-08 09:26:59,152 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
54 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
55 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
56 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
57 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
58 on 31250: starting
2011-10-08 09:26:59,153 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
59 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
60 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
61 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
62 on 31250: starting
2011-10-08 09:26:59,154 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
63 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
64 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
65 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
66 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
67 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
68 on 31250: starting
2011-10-08 09:26:59,155 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
69 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
70 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
71 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
72 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
73 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
74 on 31250: starting
2011-10-08 09:26:59,156 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
75 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
76 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
77 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
78 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
79 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
80 on 31250: starting
2011-10-08 09:26:59,157 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
81 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
82 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
83 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
84 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
85 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
86 on 31250: starting
2011-10-08 09:26:59,158 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
87 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
88 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
89 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
90 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
91 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
92 on 31250: starting
2011-10-08 09:26:59,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
93 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
94 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
95 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
96 on 31250: starting
2011-10-08 09:26:59,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
97 on 31250: starting
2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
98 on 31250: starting
2011-10-08 09:26:59,161 INFO org.apache.giraph.comm.BasicRPCCommunications: 
BasicRPCCommunications: Started RPC communication 
server:gsta33033.tan.ygrid.yahoo.com/10.216.176.59:31250  
<http://gsta33033.tan.ygrid.yahoo.com/10.216.176.59:31250>  with 100 handlers
2011-10-08 09:26:59,161 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
99 on 31250: starting
2011-10-08 09:27:05,234 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
Initializing logs' truncater with mapRetainSize=102400 and 
reduceRetainSize=102400
2011-10-08 09:27:05,236 FATAL org.apache.hadoop.mapred.Child: Error running 
child : java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:597)
        at java.lang.UNIXProcess$1.run(UNIXProcess.java:141)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:103)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
        at org.apache.hadoop.util.Shell.run(Shell.java:182)
        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540)
        at 
org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37)
        at 
org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:417)
        at 
org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:400)
        at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:275)
        at 
org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
        at org.apache.hadoop.mapred.Child.main(Child.java:255)

2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping MapTask metrics system...
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping metrics source ugi(org.apache.hadoop.security.UgiInstrumentation)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping metrics source jvm(org.apache.hadoop.metrics2.source.JvmMetricsSource)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping metrics source 
RpcDetailedActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation$Detailed)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping metrics source 
RpcActivityForPort31250(org.apache.hadoop.ipc.metrics.RpcInstrumentation)
2011-10-08 09:27:05,272 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
MapTask metrics system stopped.

--
Best Regards
Zhiwei Gu


Reply via email to