[ 
https://issues.apache.org/jira/browse/HBASE-6490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nkeywal updated HBASE-6490:
---------------------------

    Description: 
When allocating a new node during writing, hdfs tries 
'dfs.client.block.write.retries' times (default 3) to write the block. When it 
fails, it goes back to the nanenode for a new list, and raises an error if the 
number of retries is reached. In HBase, if the error is while we're writing a 
hlog file, it will trigger a region server abort (as hbase does not trust the 
log anymore). For simple case (new, and as such empty log file), this seems to 
be ok, and we don't lose data. There could be some complex cases if the error 
occurs on a hlog file with already multiple blocks written.

Logs lines are:
"Exception in createBlockOutputStream", then "Abandoning block " followed by 
"Excluding datanode " for a retry.
IOException: "Unable to create new block.", when the number of retries is 
reached.

Probability of occurence seems quite low, (number of bad nodes / number of 
nodes)^(number of retries), and it implies that you have a region server 
without its datanode. But it's per new block.

Increasing the default value of 'dfs.client.block.write.retries' could make 
sense to be better covered in chaotic conditions.
    Environment: all  (was: When allocating a new node during writing, hdfs 
tries 'dfs.client.block.write.retries' times (default 3) to write the block. 
When it fails, it goes back to the nanenode for a new list, and raises an error 
if the number of retries is reached. In HBase, if the error is while we're 
writing a hlog file, it will trigger a region server abort (as hbase does not 
trust the log anymore). For simple case (new, and as such empty log file), this 
seems to be ok, and we don't lose data. There could be some complex cases if 
the error occurs on a hlog file with already multiple blocks written.

Logs lines are:
"Exception in createBlockOutputStream", then "Abandoning block " followed by 
"Excluding datanode " for a retry.
IOException: "Unable to create new block.", when the number of retries is 
reached.

Probability of occurence seems quite low, (number of bad nodes / number of 
nodes)^(number of retries), and it implies that you have a region server 
without its datanode. But it's per new block.

Increasing the default value of 'dfs.client.block.write.retries' could make 
sense to be better covered in chaotic conditions.)
    
> 'dfs.client.block.write.retries' value could be increased in HBase
> ------------------------------------------------------------------
>
>                 Key: HBASE-6490
>                 URL: https://issues.apache.org/jira/browse/HBASE-6490
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.96.0
>         Environment: all
>            Reporter: nkeywal
>            Priority: Minor
>
> When allocating a new node during writing, hdfs tries 
> 'dfs.client.block.write.retries' times (default 3) to write the block. When 
> it fails, it goes back to the nanenode for a new list, and raises an error if 
> the number of retries is reached. In HBase, if the error is while we're 
> writing a hlog file, it will trigger a region server abort (as hbase does not 
> trust the log anymore). For simple case (new, and as such empty log file), 
> this seems to be ok, and we don't lose data. There could be some complex 
> cases if the error occurs on a hlog file with already multiple blocks written.
> Logs lines are:
> "Exception in createBlockOutputStream", then "Abandoning block " followed by 
> "Excluding datanode " for a retry.
> IOException: "Unable to create new block.", when the number of retries is 
> reached.
> Probability of occurence seems quite low, (number of bad nodes / number of 
> nodes)^(number of retries), and it implies that you have a region server 
> without its datanode. But it's per new block.
> Increasing the default value of 'dfs.client.block.write.retries' could make 
> sense to be better covered in chaotic conditions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to