[jira] Commented: (HBASE-2312) Possible data loss when RS goes into GC pause while rolling HLog

Karthik Ranganathan (JIRA) Mon, 15 Mar 2010 14:07:49 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12845526#action_12845526
 ]


Karthik Ranganathan commented on HBASE-2312:
--------------------------------------------

A little confused about your comment. We have the follwing sequence of actions:

1) Write "intend to roll HLog to new file hlog.N+1" to hlog.N
2) Open hlog.N+1 for append
3) Write "finished rolling" to hlog.N
4) continue writing to hlog.N+1

If the GC pause hits before 2, no new log file is created. Master will take the 
append lease on log.N and step 3 will fail later. No edits could have gone into 
the new log.
If the GC pause hits after 3, the new log file is the one in effect, so no 
issues there.
If the GC pause hits after 2 but before 3, the master will always see the last 
log file (log.N+1) right? So master will try to take the append lease on 
log.N+1.
  - Master gets the append lease on log.N+1 in which case at the most RS does 
step 3 and fails on 4
  - Master does not get the lease on log.N+1, its still waiting for it, in 
which case the RS logs the edits to log.N+1 and then quits. Master does not 
lose the edits.

What is the scenario when the master chases the RS? The only thing I can think 
of is that step 2 takes a long time - but presumable the detection of the RS 
being dead takes longer?

> Possible data loss when RS goes into GC pause while rolling HLog
> ----------------------------------------------------------------
>
>                 Key: HBASE-2312
>                 URL: https://issues.apache.org/jira/browse/HBASE-2312
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.3
>            Reporter: Karthik Ranganathan
>
> There is a very corner case when bad things could happen(ie data loss):
> 1)    RS #1 is going to roll its HLog - not yet created the new one, old one 
> will get no more writes
> 2)    RS #1 enters GC Pause of Death
> 3)    Master lists HLog files of RS#1 that is has to split as RS#1 is dead, 
> starts splitting
> 4)    RS #1 wakes up, created the new HLog (previous one was rolled) and 
> appends an edit - which is lost
> The following seems like a possible solution:
> 1)    Master detects RS#1 is dead
> 2)    The master renames the /hbase/.logs/<regionserver name>  directory to 
> something else (say /hbase/.logs/<regionserver name>-dead)
> 3)    Add mkdir support (as opposed to mkdirs) to HDFS - so that a file 
> create fails if the directory doesn't exist. Dhruba tells me this is very 
> doable.
> 4)    RS#1 comes back up and is not able create the new hlog. It restarts 
> itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2312) Possible data loss when RS goes into GC pause while rolling HLog

Reply via email to