Thanks a lot for the reply.
I have made several changes here and there, but I do not think they are
quite relevant (the most relevant modification on the control flow of JIP
construction is at the end of JobInProgress$initTasks() where I add all the
map tasks of this job into a data structure in JT I maintained by my
code)....May I describe a case that I think might be a problem and you can
help me identify if it is a bug or I am not using hadoop correctly.
1. In my environment, there is only one user, ec2-user, which is the owner
of JT, the submitter of all the jobs, the owner of HDFS, etc....
2.
(1) job1 submitted and its JIP jip1 constructed successfully
(2) jip1 starts its initTasks() function and inside that function it calls
generateAndStoreTokens() which calls tokenStorage.writeTokenStorageFile(),
which opens a JobToken file, jobToken1, write and not close it yet. This
file is opend with ownership ec2-user
(3) job2 submitted and inside its finally clause of constructor, it calls
FileSystem.closeAllForUGI('ec2-user'), inside which it calls
dfsClient.close()
(4) in version 1.1.0 and before, dfsClient has a leaseChecker object that
maintains all the writing files and their outputstream, dfsClient.close()
calls leaseChecker.close() which pops out the files one by one in a
synchronized block but close the related outputstream outside the block. So
in jip2 constructor, FileSystem.closeAllForUGI('ec2-user'), it gets*jobToken1
* and its outputstream, but not close it yet.
(5) the outputstream from (2) of jip1 close its stream of jobToken1
(6) the outputstream from (4) of jip2 tries to closes its stream of
jobToken1 but fails with a null pointer exception.
3. Even if the leaseChecker object is not used in later version, I still do
not understand why it is safe to call closeAllForUGI('ec2-user') in job2
constructor when job1 could write a jobToken file in initTasks() also with
ownership 'ec2-user'.
Thanks.
Xiao Yu
On Fri, Mar 22, 2013 at 7:30 AM, Harsh J <[email protected]> wrote:
> The current user (UserGroupInformation.getCurrentUser()) is the user
> active in the RPC call thats invoking these functions, and not the JT
> user exactly.
>
> However, given that the JIP construction is outside of a synchronized
> step and can potentially happen in parallel with another JIP request,
> it is possible that you may have identified a possible bug here.
>
> I've not seen this happen though, even at high loads of submits from a
> single user (where I think this could happen). Can you detail your
> changes, cause it could be somewhat related to that as well? The UGI
> compare inside of closeAllForUGI is probably protective enough but
> it'd still be worth looking into.
>
> On Fri, Mar 22, 2013 at 4:12 AM, Xiao Yu <[email protected]> wrote:
> > Hi,
> >
> > This might be a naive question, but I am having a difficult time to
> > understand it. At the end of the constructor of JobInProgress, in the
> > finally clause, the code calls
> > FileSystem.closeAllForUGI(UserGroupInformation.getCurrentUser()), but why
> > is it safe.
> >
> > My concern is that the current user is the owner of jobtracker, so it
> will
> > close all the files the jobtracker is writing, such as a jobtoken file
> > another jip is currently writing.
> >
> > I modified some of the code of hadoop-1.1.0 for my research project and
> saw
> > the following error. It could be some bug in my code, but I suspect it
> is a
> > combined effect of this closeAllForUGI function and perhaps a race
> > condition in the DFSClient$LeaseChecker.close().
> >
> > Could you help me understand why it is safe to call this
> > FileSystem.closeAllForUGI function at the end of the JobInProgress
> > constructor?
> >
> > Thank you very much indeed.
> >
> > ===
> >
> > 2013-03-21 21:04:56,677 INFO org.apache.hadoop.mapred.JobTracker:
> > Initializing job_201303212104_0005
> > 2013-03-21 21:04:56,677 INFO org.apache.hadoop.mapred.JobInProgress:
> > Initializing job_201303212104_0005
> > 2013-03-21 21:04:56,841 INFO org.apache.hadoop.hdfs.DFSClient:
> > read.type=dfs block=blk_700146088908855679_2191
> >
> src=/home/ec2-user/Hadoop/tmp/mapred/staging/ec2-user/.staging/job_201303212104_0006/job.xml
> > 2013-03-21 21:04:56,897 INFO org.apache.hadoop.hdfs.DFSClient:
> ===streamer
> > closed
> here===:org.apache.hadoop.hdfs.DFSClient$DFSOutputStream@239cd5f5dfsclient
> =
> > DFSClient[clientName=DFSClient_-536745704, ugi=ec2-user]
> > leasechecker remove source:
> > /test/out/intrecord/4/_logs/history/job_201303212104_0005_conf.xml
> > 2013-03-21 21:04:56,898 INFO org.apache.hadoop.mapred.JobInProgress:
> > ===generating job token===
> > 2013-03-21 21:04:56,931 INFO org.apache.hadoop.mapred.JobInProgress:
> > job_201303212104_0006: nMaps=1 nReduces=0 max=-1
> > 2013-03-21 21:04:56,932 INFO org.apache.hadoop.mapred.JobInProgress:
> > ===closeAllForUGI here===:ec2-user jobid= job_201303212104_0006
> > *2013-03-21 21:04:56,973 INFO org.apache.hadoop.hdfs.DFSClient:
> ===streamer
> > closed
> here===:org.apache.hadoop.hdfs.DFSClient$DFSOutputStream@6d56d7c8dfsclient
> =
> > DFSClient[clientName=DFSClient_1735333485, ugi=ec2-user]
> > leasechecker remove source:
> > /home/ec2-user/Hadoop/tmp/mapred/system/job_201303212104_0005/jobToken*
> > 2013-03-21 21:04:56,973 INFO org.apache.hadoop.mapred.JobInProgress:
> > jobToken generated and stored with users keys in
> > /home/ec2-user/Hadoop/tmp/mapred/system/job_201303212104_0005/jobToken
> > *2013-03-21 21:04:56,973 INFO org.apache.hadoop.hdfs.DFSClient:
> ===streamer
> > null to
> close===:org.apache.hadoop.hdfs.DFSClient$DFSOutputStream@6d56d7c8dfsclient
> =
> > DFSClient[clientName=DFSClient_1735333485, ugi=ec2-user]
> > leasechecker should remove source:
> > /home/ec2-user/Hadoop/tmp/mapred/system/job_201303212104_0005/jobToken*
> > 2013-03-21 21:04:56,974 ERROR
> > org.apache.hadoop.security.UserGroupInformation:
> PriviledgedActionException
> > as:ec2-user cause:java.io.IOException: java.lang.NullPointerException
> > *2013-03-21 21:04:56,976 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 4 on 9101, call submitJob(job_201303212104_0006, hdfs://
> >
> 10.151.93.195:9100/home/ec2-user/Hadoop/tmp/mapred/staging/ec2-user/.staging/job_201303212104_0006
> ,
> > org.apache.hadoop.security.Credentials@6070c38c) from
> 10.151.93.195:41602:
> > error: java.io.IOException: java.lang.NullPointerException*
> > java.io.IOException: java.lang.NullPointerException
> > at
> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3799)
> > at
> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3763)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:616)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
> > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:416)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)
> > Caused by: java.lang.NullPointerException
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeThreads(DFSClient.java:4177)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:4220)
> > at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:4145)
> > at
> > org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:1475)
> > at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:341)
> > at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:339)
> > at
> org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1499)
> > at
> org.apache.hadoop.fs.FileSystem.closeAllForUGI(FileSystem.java:289)
> > at
> org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:498)
> > at
> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3797)
> > ... 12 more
> > 2013-03-21 21:04:56,979 INFO org.apache.hadoop.hdfs.DFSClient:
> > read.type=dfs block=blk_972265969568196888_2183
> >
> src=/home/ec2-user/Hadoop/tmp/mapred/staging/ec2-user/.staging/job_201303212104_0005/job.splitmetainfo
> >
> >
> > Xiao Yu
>
>
>
> --
> Harsh J
>