[
https://issues.apache.org/jira/browse/HBASE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13137921#comment-13137921
]
Nicolas Spiegelberg commented on HBASE-2312:
--------------------------------------------
Have a review up that integrates distributed log splitting. Note that it also
fixes that elusive bug in TestReplication, which was the original reason we
delayed checking this patch in. The problem was that ReplicationSource would
check in the alive RS folder instead of the dead RS folder to verify that it
should stall and wait for Log Splitting to finish and move to the OldLogs
directory. If JD could please verify. I think Prakash has a little more work
to do here for the ProcessServerDeath case, but this is an existing bug. We
should file another JIRA for that and get this one committed :)
> Possible data loss when RS goes into GC pause while rolling HLog
> ----------------------------------------------------------------
>
> Key: HBASE-2312
> URL: https://issues.apache.org/jira/browse/HBASE-2312
> Project: HBase
> Issue Type: Bug
> Components: master, regionserver
> Affects Versions: 0.90.0
> Reporter: Karthik Ranganathan
> Assignee: Nicolas Spiegelberg
> Priority: Critical
> Fix For: 0.92.0
>
> Attachments: D99.1.patch
>
>
> There is a very corner case when bad things could happen(ie data loss):
> 1) RS #1 is going to roll its HLog - not yet created the new one, old one
> will get no more writes
> 2) RS #1 enters GC Pause of Death
> 3) Master lists HLog files of RS#1 that is has to split as RS#1 is dead,
> starts splitting
> 4) RS #1 wakes up, created the new HLog (previous one was rolled) and
> appends an edit - which is lost
> The following seems like a possible solution:
> 1) Master detects RS#1 is dead
> 2) The master renames the /hbase/.logs/<regionserver name> directory to
> something else (say /hbase/.logs/<regionserver name>-dead)
> 3) Add mkdir support (as opposed to mkdirs) to HDFS - so that a file
> create fails if the directory doesn't exist. Dhruba tells me this is very
> doable.
> 4) RS#1 comes back up and is not able create the new hlog. It restarts
> itself.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira