[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881605#action_12881605
 ] 

Todd Lipcon commented on HDFS-1262:
-----------------------------------

Solving this is a bit tricky... brainstorming a couple options, not sure what's 
best, or if I'm missing a simpler one:

Option 1) change renewLease function to pass a list of all open files. In 
LeaseManager maintain each lease with a separate timestamp. Thus if we end up 
with an inconsistency like this, it will get resolved since the lease on just 
that file will get expired. This is somewhat complicated, breaks RPC 
compatibility, etc. Not so great...

Option 2) add a finally clause to the pipeline setup in DFSClient.append() such 
that if the pipeline can't get set up, we re-close the file before throwing an 
exception. I don't think we can use completeFile() to do this, since the blocks 
aren't in blocksMap right after append() - so completeFile would loop forever. 
We could overload abandonBlock() for this purpose, perhaps. Alternatively maybe 
it's not right that appendFile() removes from blocks map, but rather the 
initial commitBlockSynchronization should be the thing that does it?

> Failed pipeline creation during append leaves lease hanging on NN
> -----------------------------------------------------------------
>
>                 Key: HDFS-1262
>                 URL: https://issues.apache.org/jira/browse/HDFS-1262
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20-append
>            Reporter: Todd Lipcon
>            Priority: Critical
>             Fix For: 0.20-append
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to