[jira] Commented: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock

2010-07-15 Thread Thanh Do (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889001#action_12889001
 ] 

Thanh Do commented on HDFS-1229:


this does not happens in the append+320 trunk.

 DFSClient incorrectly asks for new block if primary crashes during first 
 recoverBlock
 -

 Key: HDFS-1229
 URL: https://issues.apache.org/jira/browse/HDFS-1229
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20-append
Reporter: Thanh Do

 Setup:
 
 + # available datanodes = 2
 + # disks / datanode = 1
 + # failures = 1
 + failure type = crash
 + When/where failure happens = during primary's recoverBlock
  
 Details:
 --
 Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
 First it needs to make sure both dn1 and dn2  agree on the new GS of the 
 block.
 1) Client first creates DFSOutputStream by calling
  
 OutputStream result = new DFSOutputStream(src, buffersize, progress,
 lastBlock, stat, 
  conf.getInt(io.bytes.per.checksum, 512));
  
 in DFSClient.append()
  
 2) The above DFSOutputStream constructor in turn calls 
 processDataNodeError(true, true)
 (i.e, hasError = true, isAppend = true), and starts the DataStreammer
  
  processDatanodeError(true, true);  /* let's call this PDNE 1 */
  streamer.start();
  
 Note that DataStreammer.run() also calls processDatanodeError()
  while (!closed  clientRunning) {
   ...
   boolean doSleep = processDatanodeError(hasError, false); /let's call 
  this PDNE 2*/
  
 3) Now in the PDNE 1, we have following code:
  
  blockStream = null;
  blockReplyStream = null;
  ...
  while (!success  clientRunning) {
  ...
 try {
  primary = createClientDatanodeProtocolProxy(primaryNode, conf);
  newBlock = primary.recoverBlock(block, isAppend, newnodes); 
  /*exception here*/
  ...
 catch (IOException e) {
  ...
  if (recoveryErrorCount  maxRecoveryErrorCount) { 
  // this condition is false
  }
  ...
  return true;
 } // end catch
 finally {...}
 
 this.hasError = false;
 lastException = null;
 errorIndex = 0;
 success = createBlockOutputStream(nodes, clientName, true);
 }
 ...
  
 Because dn1 crashes during client call to recoverBlock, we have an exception.
 Hence, go to the catch block, in which processDatanodeError returns true
 before setting hasError to false. Also, because createBlockOutputStream() is 
 not called
 (due to an early return), blockStream is still null.
  
 4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2.
 Because hasError = false, PDNE 2 returns false immediately without doing 
 anything
  if (!hasError) { return false; }
  
 5) still in the DataStreamer.run(), after returning false from PDNE 2, we 
 still have
 blockStream = null, hence the following code is executed:
 if (blockStream == null) {
nodes = nextBlockOutputStream(src);
this.setName(DataStreamer for file  + src +  block  + block);
response = new ResponseProcessor(nodes);
response.start();
 }
  
 nextBlockOutputStream which asks namenode to allocate new Block is called.
 (This is not good, because we are appending, not writing).
 Namenode gives it new Block ID and a set of datanodes, including crashed dn1.
 this leads to createOutputStream() fails because it tries to contact the dn1 
 first.
 (which has crashed). The client retries 5 times without any success,
 because every time, it asks namenode for new block! Again we see
 that the retry logic at client is weird!
 *This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-1229) DFSClient incorrectly asks for new block if primary crashes during first recoverBlock

2010-06-22 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881540#action_12881540
 ] 

Todd Lipcon commented on HDFS-1229:
---

Hi Thanh,

Can you please make a JUnit test for this against branch-0.20-append? The tests 
in TestFileAppend4 should be a good model.

-Todd

 DFSClient incorrectly asks for new block if primary crashes during first 
 recoverBlock
 -

 Key: HDFS-1229
 URL: https://issues.apache.org/jira/browse/HDFS-1229
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs client
Affects Versions: 0.20-append
Reporter: Thanh Do

 Setup:
 
 + # available datanodes = 2
 + # disks / datanode = 1
 + # failures = 1
 + failure type = crash
 + When/where failure happens = during primary's recoverBlock
  
 Details:
 --
 Say client is appending to block X1 in 2 datanodes: dn1 and dn2.
 First it needs to make sure both dn1 and dn2  agree on the new GS of the 
 block.
 1) Client first creates DFSOutputStream by calling
  
 OutputStream result = new DFSOutputStream(src, buffersize, progress,
 lastBlock, stat, 
  conf.getInt(io.bytes.per.checksum, 512));
  
 in DFSClient.append()
  
 2) The above DFSOutputStream constructor in turn calls 
 processDataNodeError(true, true)
 (i.e, hasError = true, isAppend = true), and starts the DataStreammer
  
  processDatanodeError(true, true);  /* let's call this PDNE 1 */
  streamer.start();
  
 Note that DataStreammer.run() also calls processDatanodeError()
  while (!closed  clientRunning) {
   ...
   boolean doSleep = processDatanodeError(hasError, false); /let's call 
  this PDNE 2*/
  
 3) Now in the PDNE 1, we have following code:
  
  blockStream = null;
  blockReplyStream = null;
  ...
  while (!success  clientRunning) {
  ...
 try {
  primary = createClientDatanodeProtocolProxy(primaryNode, conf);
  newBlock = primary.recoverBlock(block, isAppend, newnodes); 
  /*exception here*/
  ...
 catch (IOException e) {
  ...
  if (recoveryErrorCount  maxRecoveryErrorCount) { 
  // this condition is false
  }
  ...
  return true;
 } // end catch
 finally {...}
 
 this.hasError = false;
 lastException = null;
 errorIndex = 0;
 success = createBlockOutputStream(nodes, clientName, true);
 }
 ...
  
 Because dn1 crashes during client call to recoverBlock, we have an exception.
 Hence, go to the catch block, in which processDatanodeError returns true
 before setting hasError to false. Also, because createBlockOutputStream() is 
 not called
 (due to an early return), blockStream is still null.
  
 4) Now PDNE 1 has finished, we come to streamer.start(), which calls PDNE 2.
 Because hasError = false, PDNE 2 returns false immediately without doing 
 anything
  if (!hasError) { return false; }
  
 5) still in the DataStreamer.run(), after returning false from PDNE 2, we 
 still have
 blockStream = null, hence the following code is executed:
 if (blockStream == null) {
nodes = nextBlockOutputStream(src);
this.setName(DataStreamer for file  + src +  block  + block);
response = new ResponseProcessor(nodes);
response.start();
 }
  
 nextBlockOutputStream which asks namenode to allocate new Block is called.
 (This is not good, because we are appending, not writing).
 Namenode gives it new Block ID and a set of datanodes, including crashed dn1.
 this leads to createOutputStream() fails because it tries to contact the dn1 
 first.
 (which has crashed). The client retries 5 times without any success,
 because every time, it asks namenode for new block! Again we see
 that the retry logic at client is weird!
 *This bug was found by our Failure Testing Service framework:
 http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
 For questions, please email us: Thanh Do (than...@cs.wisc.edu) and 
 Haryadi Gunawi (hary...@eecs.berkeley.edu)*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.