Yu Li created HBASE-20156:
Summary: Allow regionserver to live during HDFS failure
Issue Type: New Feature
Reporter: Yu Li
Currently if something is wrong with HDFS, for example NN fencing or get into
safe mode, RS will abort itself immediately after detecting it (such as log
roll or flush fail). And if we have a large scale cluster with dense writing
workload, there will be a huge amount of WAL to split and replay when HDFS is
back, and the recovery time might be tens of minutes or even hours (actually we
experienced this more than once in production, there're always some surprise
like unstable power supply for NN that we never expected...).
Here we propose to add an option to allow RS not aborting during HDFS failure,
instead we will throw exceptions to clients indicating we're out of service,
while we could get recovered right after HDFS is back.
This will also make it possible to restart HDFS in some extreme case, and allow
us to survive if anything wrong happened during HDFS upgrading.
This message was sent by Atlassian JIRA