[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221046#comment-14221046
 ] 

Yongjun Zhang commented on HDFS-4882:
-------------------------------------

Thanks [~cmccabe]  and [~vinayrpet] for the review and further look.

HI Vinayakumar, 

Yes, the testcase Ravi created in HDFS-7342 is a bit artificial, but it helps 
to reproduce the infinite loop case we are dealing with here. Setting the 
leasePeriod  to 0 is just to trigger the lease recovery thus the infinite loop 
in this test.

The steps you described appear to be a different scenario than what we are 
addressing in this jira. The case here is that the lastBlock is already in 
COMPELTE state when the hard lease limit expires. Your case is that the last 
block is not yet in COMPLETE state. When the last block is in  
*UNDER_CONSTRUCTION or UNDER_RECOVERY* state, it's handled differently in 
{{FSNamesystem#internalReleaseLease}}, as you described,  I believe that's why 
you don't see the infinitely loop.

What you described is an interesting case too though. for the failure in step 5 
you described, would you please create a different jira for your case? Thanks.
 
HI [~raviprak], would you please help addressing my earlier comments that Colin 
agreed upon? thanks.


> Namenode LeaseManager checkLeases() runs into infinite loop
> -----------------------------------------------------------
>
>                 Key: HDFS-4882
>                 URL: https://issues.apache.org/jira/browse/HDFS-4882
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client, namenode
>    Affects Versions: 2.0.0-alpha, 2.5.1
>            Reporter: Zesheng Wu
>            Assignee: Ravi Prakash
>            Priority: Critical
>         Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
> HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.5.patch, 
> HDFS-4882.6.patch, HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client 
> sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline 
> because of the current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after 
> written all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our 
> dfs.namenode.replication.min=2), and the client's close runs into indefinite 
> loop(HDFS-2936), and at the same time, NN makes the last block's state to 
> COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call 
> fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's 
> infinite loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
>  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
> blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
> for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to