Dapeng Sun created HDFS-6211:
--------------------------------

             Summary: Thread leak of JobTracker when using OOZIE to submit a job
                 Key: HDFS-6211
                 URL: https://issues.apache.org/jira/browse/HDFS-6211
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: hdfs-client
    Affects Versions: 1.1.2
            Reporter: Dapeng Sun


Scene:  When use OOZIE to run pig script, if the fs.default.name is hostname, 
like FQDN, but OOZIE spec an IP address in its configuration , the JobTracker 
will thread leak after many jobs submitted.

I investigated the issue: In JobTracker, it will use a CACHE to cache DFSClient 
in FileSystem, the CACHE is a Map<Key, FileSystem>, the Key of the Cache has 
three members: scheme, server-host, and (UserGroupInfomation)ugi,  and when 
client request an instances, it will get from cache first.

The issue is jobtracker is crashed by thread leak after many jobs submitted 
through OOZIE and the leaked thread is LeaseChecker. it was created by 
DFSClient and thread will run till the DFSClient was closed, FileSystem is an 
abstract class, and his Implementation DistributedFileSystem has a member 
DFSClient, so if a cached DFSClient isn't close, it will cause thread leak.

JobClient will generated and upload the related files likes the properties file 
"Job_XXX.xml" to HDFS, in normal cluster, JobClient will read the properties of 
core-site.xml, the "fs.default.name" is same usually, but OOZIE is special, 
it’s a workflow engine, it use the same class of Hadoop configuration, but 
OOZIE didn’t read Hadoop configuration. So every jobs must specify HDFS URI and 
Job Tracker (Resource Manager)’s address. It will put a property named 
”fs.default.name” to job.xml

When Jobtracker initialize a job to run, it will read the JobConfiguration from 
Job_XXX.xml, the property "fs.default.name" in core-site.xml was overrided by 
JobConf, if the code, which is related to JobConf, get DFSClient from the 
CACHE, the properties will be changed, likes hostname changed to IP address. it 
will use an different key to get from the Cache, most of the cached DFSClient 
would be closed by CleanupQueue or close directly, the changed key can't be 
closed, so it cause the thread leak.


Solutions:
1.Make the property is final, after add an attribute ”final” to 
"fs.default.name", it would not be override at any time
After make the property is final, no subsequent load could change it, it will 
affect all the cluster, no sure if other components had no requirement to 
change it.
2.Transform hostname to IP-address of the Key in JobTracker.
It need a patch to HDFS, We should transform it before create the key.


Here is the example scripts to reproduce the issue:
File:a.pig
A = load '/user/root/oozietest/input.data' USING PigStorage(',') AS (c1:int, 
c2:chararray);
B= group A by c2;
C= FOREACH B GENERATE group , SUM(A.c1);
STORE C into '/user/root/oozietest/output' using PigStorage (';');

File:input.data
1,aaa
2,bbb
3,ccc
4,aaa

File:workflow.xml
<?xml version="1.0"?>
<workflow-app xmlns="uri:oozie:workflow:0.1" name="OoziepigTest">
  <start to="Step1"/>
  <action name="Step1">
    <pig>
      <job-tracker>#ip-address#:54311</job-tracker>
      <name-node>hdfs://#ip-address#:8020</name-node>
      <script>a.pig</script>
    </pig>
    <ok to="end"/>
    <error to="end"/>
  </action>
  <end name="end"/>
</workflow-app>

File:job.properties
oozie.use.system.libpath=true
oozie.libpath=/user/oozie/share/lib
oozie.wf.application.path=hdfs://#ip-address#:8020/user/${user.name}/oozietest




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to