[ 
https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215345#comment-14215345
 ] 

Yongjun Zhang commented on HDFS-4882:
-------------------------------------

HI Jing,

Thanks for your comments.

There are multiple issues here:

1. The bug in the loop code of the checkLeases method that it doesn't move on 
to next entry
2. The case when penultimate block is COMMITTED and last block is COMPLETE,
3. The case when penultimate block is COMPLETE and the last block is COMMITTED,

Observations about these issues:

Issue 1 is addressed by the uploaded patch so far.

Issue 3 is handled by existing code by throwing an exception (senario#2 in my 
earlier comments), and then the leaseManager goes ahead recover the lease right 
away, even if the last block may not have minimal replica. I raised a concern 
about this earlier: what happens if the minimal replica is 1, which means no 
replica for the last block, and another client acquires the lease and tries to 
APPEND to the file. I mean, where (datanode, replica position) the client 
should write to? 

Issue 2, we can move on and let the penultimate block to finish it self, if it 
finishes meeting the minimal replication requirement, that's nice, if not, it's 
considered a corrupted block. As Colin/Jing suggested.

So it boils down to issue 3, did I misunderstand it? is this a real issue? 

Thanks.




> Namenode LeaseManager checkLeases() runs into infinite loop
> -----------------------------------------------------------
>
>                 Key: HDFS-4882
>                 URL: https://issues.apache.org/jira/browse/HDFS-4882
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client, namenode
>    Affects Versions: 2.0.0-alpha, 2.5.1
>            Reporter: Zesheng Wu
>            Assignee: Ravi Prakash
>            Priority: Critical
>         Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, 
> HDFS-4882.2.patch, HDFS-4882.3.patch, HDFS-4882.4.patch, HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client 
> sets the pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline 
> because of the current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after 
> written all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our 
> dfs.namenode.replication.min=2), and the client's close runs into indefinite 
> loop(HDFS-2936), and at the same time, NN makes the last block's state to 
> COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call 
> fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's 
> infinite loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease [Lease.  Holder: 
> DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease. 
>  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File = /user/h_wuzesheng/test.dat, block 
> blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started block recovery 
> for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to