[
https://issues.apache.org/jira/browse/MAPREDUCE-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848406#comment-13848406
]
viswanathan commented on MAPREDUCE-5351:
----------------------------------------
Hi Chris,
JT memory reaches 6.68/8.89 GB and not able to submit the job and UI is not
loading at all. But didn't see any JT OOM exceptions.
Have taken the thread dump of Jobtracker, and the JT thread dump as follows:
Deadlock Detection:
Can't print deadlocks:null
Thread 25817: (state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be
imprecise)
- org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=274, line=397 (Compiled
frame)
-
org.apache.hadoop.hdfs.LeaseRenewer.access$600(org.apache.hadoop.hdfs.LeaseRenewer,
int) @bci=2, line=69 (Interpreted frame)
- org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=8, line=273 (Interpreted
frame)
- java.lang.Thread.run() @bci=11, line=662 (Interpreted frame)
Locked ownable synchronizers:
- None
Thread 25815: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
- org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run()
@bci=245, line=3000 (Compiled frame)
Locked ownable synchronizers:
- None
Thread 25813: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
- org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747
(Compiled frame)
- org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled
frame)
Locked ownable synchronizers:
- None
Thread 25812: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
- org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747
(Compiled frame)
- org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled
frame)
Locked ownable synchronizers:
- None
Thread 25790: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
Thread 25788: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
- org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747
(Compiled frame)
- org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled
frame)
Locked ownable synchronizers:
- None
Thread 25786: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be
imprecise)
- org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=747
(Compiled frame)
- org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=789 (Compiled
frame)
Locked ownable synchronizers:
- None
Thread 25761: (state = BLOCKED)
- sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
(Compiled frame; information may be imprecise)
- sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=210 (Compiled frame)
- sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=65 (Compiled frame)
- sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=69 (Compiled
frame)
- sun.nio.ch.SelectorImpl.select(long) @bci=30, line=80 (Interpreted frame)
-
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
int, long) @bci=46, line=332 (Interpreted frame)
- org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, int)
@bci=80, line=157 (Compiled frame)
- org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) @bci=6,
line=155 (Compiled frame)
- org.apache.hadoop.net.SocketInputStream.read(byte[], int, int) @bci=7,
line=128 (Compiled frame)
- java.io.FilterInputStream.read(byte[], int, int) @bci=7, line=116
(Interpreted frame)
- org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(byte[], int,
int) @bci=4, line=364 (Interpreted frame)
- java.io.BufferedInputStream.fill() @bci=175, line=218 (Compiled frame)
- java.io.BufferedInputStream.read() @bci=12, line=237 (Compiled frame)
- java.io.DataInputStream.readInt() @bci=4, line=370 (Compiled frame)
- org.apache.hadoop.ipc.Client$Connection.receiveResponse() @bci=19, line=845
(Compiled frame)
- org.apache.hadoop.ipc.Client$Connection.run() @bci=62, line=790 (Compiled
frame)
-------------------------------------------------------------------------------------------------------------------------------
And Jobtracker heap summary as follows:
using thread-local object allocation.
Parallel GC with 10 thread(s)
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 10737418240 (10240.0MB)
NewSize = 1310720 (1.25MB)
MaxNewSize = 17592186044415 MB
OldSize = 5439488 (5.1875MB)
NewRatio = 2
SurvivorRatio = 8
PermSize = 21757952 (20.75MB)
MaxPermSize = 85983232 (82.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 6488064 (6.1875MB)
used = 6488064 (6.1875MB)
free = 0 (0.0MB)
100.0% used
>From Space:
capacity = 9764864 (9.3125MB)
used = 0 (0.0MB)
free = 9764864 (9.3125MB)
0.0% used
To Space:
capacity = 9764864 (9.3125MB)
used = 0 (0.0MB)
free = 9764864 (9.3125MB)
0.0% used
PS Old Generation
capacity = 7158300672 (6826.6875MB)
used = 7158240200 (6826.629829406738MB)
free = 60472 (0.05767059326171875MB)
99.99915521849708% used
PS Perm Generation
capacity = 26738688 (25.5MB)
used = 26428648 (25.204322814941406MB)
free = 310040 (0.29567718505859375MB)
98.8404816272212% used
Please help. It affects our production system.
Thanks,
Viswa
> JobTracker memory leak caused by CleanupQueue reopening FileSystem
> ------------------------------------------------------------------
>
> Key: MAPREDUCE-5351
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5351
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Affects Versions: 1.1.2
> Reporter: Sandy Ryza
> Assignee: Sandy Ryza
> Priority: Critical
> Fix For: 1-win, 1.2.1
>
> Attachments: JobInProgress_JobHistory.patch, MAPREDUCE-5351-1.patch,
> MAPREDUCE-5351-2.patch, MAPREDUCE-5351-addendum-1.patch,
> MAPREDUCE-5351-addendum.patch, MAPREDUCE-5351.patch
>
>
> When a job is completed, closeAllForUGI is called to close all the cached
> FileSystems in the FileSystem cache. However, the CleanupQueue may run after
> this occurs and call FileSystem.get() to delete the staging directory, adding
> a FileSystem to the cache that will never be closed.
> People on the user-list have reported this causing their JobTrackers to OOME
> every two weeks.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)