Dapeng Sun created HDFS-6211:
--------------------------------
Summary: Thread leak of JobTracker when using OOZIE to submit a job
Key: HDFS-6211
URL: https://issues.apache.org/jira/browse/HDFS-6211
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs-client
Affects Versions: 1.1.2
Reporter: Dapeng Sun
Scene: When use OOZIE to run pig script, if the fs.default.name is hostname,
like FQDN, but OOZIE spec an IP address in its configuration , the JobTracker
will thread leak after many jobs submitted.
I investigated the issue: In JobTracker, it will use a CACHE to cache DFSClient
in FileSystem, the CACHE is a Map<Key, FileSystem>, the Key of the Cache has
three members: scheme, server-host, and (UserGroupInfomation)ugi, and when
client request an instances, it will get from cache first.
The issue is jobtracker is crashed by thread leak after many jobs submitted
through OOZIE and the leaked thread is LeaseChecker. it was created by
DFSClient and thread will run till the DFSClient was closed, FileSystem is an
abstract class, and his Implementation DistributedFileSystem has a member
DFSClient, so if a cached DFSClient isn't close, it will cause thread leak.
JobClient will generated and upload the related files likes the properties file
"Job_XXX.xml" to HDFS, in normal cluster, JobClient will read the properties of
core-site.xml, the "fs.default.name" is same usually, but OOZIE is special,
it’s a workflow engine, it use the same class of Hadoop configuration, but
OOZIE didn’t read Hadoop configuration. So every jobs must specify HDFS URI and
Job Tracker (Resource Manager)’s address. It will put a property named
”fs.default.name” to job.xml
When Jobtracker initialize a job to run, it will read the JobConfiguration from
Job_XXX.xml, the property "fs.default.name" in core-site.xml was overrided by
JobConf, if the code, which is related to JobConf, get DFSClient from the
CACHE, the properties will be changed, likes hostname changed to IP address. it
will use an different key to get from the Cache, most of the cached DFSClient
would be closed by CleanupQueue or close directly, the changed key can't be
closed, so it cause the thread leak.
Solutions:
1.Make the property is final, after add an attribute ”final” to
"fs.default.name", it would not be override at any time
After make the property is final, no subsequent load could change it, it will
affect all the cluster, no sure if other components had no requirement to
change it.
2.Transform hostname to IP-address of the Key in JobTracker.
It need a patch to HDFS, We should transform it before create the key.
Here is the example scripts to reproduce the issue:
File:a.pig
A = load '/user/root/oozietest/input.data' USING PigStorage(',') AS (c1:int,
c2:chararray);
B= group A by c2;
C= FOREACH B GENERATE group , SUM(A.c1);
STORE C into '/user/root/oozietest/output' using PigStorage (';');
File:input.data
1,aaa
2,bbb
3,ccc
4,aaa
File:workflow.xml
<?xml version="1.0"?>
<workflow-app xmlns="uri:oozie:workflow:0.1" name="OoziepigTest">
<start to="Step1"/>
<action name="Step1">
<pig>
<job-tracker>#ip-address#:54311</job-tracker>
<name-node>hdfs://#ip-address#:8020</name-node>
<script>a.pig</script>
</pig>
<ok to="end"/>
<error to="end"/>
</action>
<end name="end"/>
</workflow-app>
File:job.properties
oozie.use.system.libpath=true
oozie.libpath=/user/oozie/share/lib
oozie.wf.application.path=hdfs://#ip-address#:8020/user/${user.name}/oozietest
--
This message was sent by Atlassian JIRA
(v6.2#6252)