[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881748#action_12881748
 ] 

sam rash commented on HDFS-1262:
--------------------------------

i think something along the lines of option 2 sounds cleaner imo.

but i have another question, does the error you see have
"because current leaseholder is trying to recreate file"

it sounds like this code is executing:

{code}
        //
        // We found the lease for this file. And surprisingly the original
        // holder is trying to recreate this file. This should never occur.
        //
        if (lease != null) {
          Lease leaseFile = leaseManager.getLeaseByPath(src);
          if (leaseFile != null && leaseFile.equals(lease)) { 
            throw new AlreadyBeingCreatedException(
                                                 "failed to create file " + src 
+ " for " + holder +
                                                 " on client " + clientMachine 
+ 
                                                 " because current leaseholder 
is trying to recreate file.");
          }
        }
{code}

and anytime I see a comment "this should never happen" it sounds to me like the 
handling of that might be suboptimal.  is there any reason that a client 
shouldn't be able to open a file in the same mode it already has it open?  
NN-side, it's a basically a no-op, or a explicit lease renewal.

any reason we can't make the above code do that?  (log something and return)


> Failed pipeline creation during append leaves lease hanging on NN
> -----------------------------------------------------------------
>
>                 Key: HDFS-1262
>                 URL: https://issues.apache.org/jira/browse/HDFS-1262
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20-append
>            Reporter: Todd Lipcon
>            Priority: Critical
>             Fix For: 0.20-append
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to