[
https://issues.apache.org/jira/browse/HBASE-20156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yu Li reassigned HBASE-20156:
-----------------------------
Assignee: Yu Li
> Allow regionserver to live during HDFS failure
> ----------------------------------------------
>
> Key: HBASE-20156
> URL: https://issues.apache.org/jira/browse/HBASE-20156
> Project: HBase
> Issue Type: New Feature
> Reporter: Yu Li
> Assignee: Yu Li
> Priority: Major
>
> Currently if something is wrong with HDFS, for example NN fencing or get into
> safe mode, RS will abort itself immediately after detecting it (such as log
> roll or flush fail). And if we have a large scale cluster with dense writing
> workload, there will be a huge amount of WAL to split and replay when HDFS is
> back, and the recovery time might be tens of minutes or even hours (actually
> we experienced this more than once in production, there're always some
> surprise like unstable power supply for NN that we never expected...).
> Here we propose to add an option to allow RS not aborting during HDFS
> failure, instead we will throw exceptions to clients indicating we're out of
> service, while we could get recovered right after HDFS is back.
> This will also make it possible to restart HDFS in some extreme case, and
> allow us to survive if anything wrong happened during HDFS upgrading.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)