[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

Enis Soztutar (JIRA) Wed, 03 Jun 2015 15:13:41 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-13832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571737#comment-14571737
 ]


Enis Soztutar commented on HBASE-13832:
---------------------------------------

I think we should copy the same semantics for the FSHlog sync / log roll 
behavior. What we have in FSHlog / LogRoller is this: 
 - Log syncer catches IOException, and logs it, and requests log roll. 
 - Log roller tries to roll the log, and if it gets an IOException in file 
close, or generic IOException while rolling, it aborts the RS. 

The reason to have the same semantics is that we do not want to cause the 
master to abort prematurely in case of a recoverable IOException like the one 
in the jira title. If the RS can ride over generic IOExceptions, the master 
should do the same.  

> Procedure V2: master fail to start due to WALProcedureStore sync failures 
> when HDFS data nodes count is low
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-13832
>                 URL: https://issues.apache.org/jira/browse/HBASE-13832
>             Project: HBase
>          Issue Type: Sub-task
>          Components: master, proc-v2
>    Affects Versions: 2.0.0, 1.1.0, 1.2.0
>            Reporter: Stephen Yuan Jiang
>            Assignee: Stephen Yuan Jiang
>
> when the data node < 3, we got failure in WALProcedureStore#syncLoop() during 
> master start.  The failure prevents master to get started.  
> {noformat}
> 2015-05-29 13:27:16,625 ERROR [WALProcedureStoreSyncThread] 
> wal.WALProcedureStore: Sync slot failed, abort.
> java.io.IOException: Failed to replace a bad datanode on the existing 
> pipeline due to no more good datanodes being available to try. (Nodes: 
> current=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
>  
> DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-490ece56c772,DISK]],
>                      
> original=[DatanodeInfoWithStorage[10.333.444.555:50010,DS-3c7777ed-93f4-47b6-9c23-1426f7a6acdc,DISK],
>  DatanodeInfoWithStorage[10.222.666.777:50010,DS-f9c983b4-1f10-4d5e-8983-    
> 490ece56c772,DISK]]). The current failed datanode replacement policy is 
> DEFAULT, and a client may configure this via 
> 'dfs.client.block.write.replace-datanode-on-failure.policy'  in its 
> configuration.
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:951)
> {noformat}
> One proposal is to implement some similar logic as FSHLog: if IOException is 
> thrown during syncLoop in WALProcedureStore#start(), instead of immediate 
> abort, we could try to roll the log and see whether this resolve the issue; 
> if the new log cannot be created or more exception from rolling the log, we 
> then abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13832) Procedure V2: master fail to start due to WALProcedureStore sync failures when HDFS data nodes count is low

Reply via email to