[
https://issues.apache.org/jira/browse/HBASE-4222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087093#comment-13087093
]
Andrew Purtell commented on HBASE-4222:
---------------------------------------
+1 We've tested this on EC2 clusters and it works.
> Make HLog more resilient to write pipeline failures
> ---------------------------------------------------
>
> Key: HBASE-4222
> URL: https://issues.apache.org/jira/browse/HBASE-4222
> Project: HBase
> Issue Type: Improvement
> Components: wal
> Reporter: Gary Helmling
> Assignee: Gary Helmling
> Fix For: 0.92.0
>
>
> The current implementation of HLog rolling to recover from transient errors
> in the write pipeline seems to have two problems:
> # When {{HLog.LogSyncer}} triggers an {{IOException}} during time-based sync
> operations, it triggers a log rolling request in the corresponding catch
> block, but only after escaping from the internal while loop. As a result,
> the {{LogSyncer}} thread will exit and never be restarted from what I can
> tell, even if the log rolling was successful.
> # Log rolling requests triggered by an {{IOException}} in {{sync()}} or
> {{append()}} never happen if no entries have yet been written to the log.
> This means that write errors are not immediately recovered, which extends the
> exposure to more errors occurring in the pipeline.
> In addition, it seems like we should be able to better handle transient
> problems, like a rolling restart of DataNodes while the HBase RegionServers
> are running. Currently this will reliably cause RegionServer aborts during
> log rolling: either an append or time-based sync triggers an initial
> {{IOException}}, initiating a log rolling request. However the log rolling
> then fails in closing the current writer ("All datanodes are bad"), causing a
> RegionServer abort. In this case, it seems like we should at least allow you
> an option to continue with the new writer and only abort on subsequent errors.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira