[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882609#action_12882609
 ] 

sam rash commented on HDFS-1262:
--------------------------------

hey, so what should the precise semantics of abandonFile(String src, String 
holder) be?  I have a quick impl now (+ test case) that does this:

1. check that holder owns the lease for src
2. call internalReleaseLeaseOne

so it really is a glorified 'cleanup and close' which has the same behavior as 
if the lease expired--nice and tidy imo.  It does have the slight delay of 
lease recovery, though.


an alternative option:

for the specific case we are fixing here, we could do something simpler such as 
just putting the targets in the blockMap and call completeFile (basically what 
commitBlockSynchronization would do).  However, this doesn't handle the general 
case if we expose abandonFile at any other time and a client has actually 
written data to last block.  

I think the first option is safer, but maybe I'm too cautious


if the way I've implemented it seems ok, I can post he patch for review asap

> Failed pipeline creation during append leaves lease hanging on NN
> -----------------------------------------------------------------
>
>                 Key: HDFS-1262
>                 URL: https://issues.apache.org/jira/browse/HDFS-1262
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20-append
>            Reporter: Todd Lipcon
>            Assignee: sam rash
>            Priority: Critical
>             Fix For: 0.20-append
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to