[ 
https://issues.apache.org/jira/browse/HDFS-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thanh Do updated HDFS-1224:
---------------------------

    Description: 
- Summary: if a datanode crashes and restarts, it may miss an append.
 
- Setup:
+ # available datanodes = 3
+ # replica = 3 
+ # disks / datanode = 1
+ # failures = 1
+ failure type = crash
+ When/where failure happens = after the first append succeed
 
- Details:
Since each datanode maintains a pool of IPC connections, whenever it wants
to make an IPC call, it first looks into the pool. If the connection is not 
there, 
it is created and put in to the pool. Otherwise the existing connection is used.
Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary.
After the client appends to block X successfully, dn2 crashes and restarts.
Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
Client starts appending to block Y. It first calls dn1.recoverBlock().
Dn1 will first create a proxy corresponding with each of the datanode in the 
pipeline
(in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
However, because
dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
Dn1 uses
this connection to make a call to dn2, hence an exception. Therefore, append 
will be
made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 
has
been successful.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)

  was:
- Summary: if a datanode crashes and restarts, it may miss an append.
 
- Setup:
+ available datanodes = 3
+ replica = 3 
+ disks / datanode = 1
+ failures = 1
+ failure type = crash
+ When/where failure happens = after the first append succeed
 
- Details:
Since each datanode maintains a pool of IPC connections, whenever it wants
to make an IPC call, it first looks into the pool. If the connection is not 
there, 
it is created and put in to the pool. Otherwise the existing connection is used.
Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the primary.
After the client appends to block X successfully, dn2 crashes and restarts.
Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
Client starts appending to block Y. It first calls dn1.recoverBlock().
Dn1 will first create a proxy corresponding with each of the datanode in the 
pipeline
(in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
However, because
dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
Dn1 uses
this connection to make a call to dn2, hence an exception. Therefore, append 
will be
made only to dn1 and dn3, although dn2 is alive and the write of block Y to dn2 
has
been successful.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
Haryadi Gunawi (hary...@eecs.berkeley.edu)


> Stale connection makes node miss append
> ---------------------------------------
>
>                 Key: HDFS-1224
>                 URL: https://issues.apache.org/jira/browse/HDFS-1224
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: data-node
>    Affects Versions: 0.20.1
>            Reporter: Thanh Do
>
> - Summary: if a datanode crashes and restarts, it may miss an append.
>  
> - Setup:
> + # available datanodes = 3
> + # replica = 3 
> + # disks / datanode = 1
> + # failures = 1
> + failure type = crash
> + When/where failure happens = after the first append succeed
>  
> - Details:
> Since each datanode maintains a pool of IPC connections, whenever it wants
> to make an IPC call, it first looks into the pool. If the connection is not 
> there, 
> it is created and put in to the pool. Otherwise the existing connection is 
> used.
> Suppose that the append pipeline contains dn1, dn2, and dn3. Dn1 is the 
> primary.
> After the client appends to block X successfully, dn2 crashes and restarts.
> Now client writes a new block Y to dn1, dn2 and dn3. The write is successful.
> Client starts appending to block Y. It first calls dn1.recoverBlock().
> Dn1 will first create a proxy corresponding with each of the datanode in the 
> pipeline
> (in order to make RPC call like getMetadataInfo( )  or updateBlock( )). 
> However, because
> dn2 has just crashed and restarts, its connection in dn1's pool become stale. 
> Dn1 uses
> this connection to make a call to dn2, hence an exception. Therefore, append 
> will be
> made only to dn1 and dn3, although dn2 is alive and the write of block Y to 
> dn2 has
> been successful.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
> Haryadi Gunawi (hary...@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to