[jira] Commented: (HBASE-2312) Possible data loss when RS goes into GC pause while rolling HLog

HBase Review Board (JIRA) Wed, 28 Jul 2010 14:40:45 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893382#action_12893382
 ]

HBase Review Board commented on HBASE-2312:
-------------------------------------------

Message from: [email protected]

bq.  On 2010-07-28 10:17:14, Todd Lipcon wrote:
bq.  > This looks good, except that it requires the HDFS patches to be in place 
to even compile. Although I think it's OK to not fix the bug in the case that 
we're on stock HDFS, I think we still need to be able to run, buggily.
bq.  > 
bq.  > Did we determine that all of the other solutions were flawed/too 
complicated?
bq.  
bq.  Nicolas wrote:
bq.      HBASE-2312 had some lengthy discussions that ultimately led to this 
route.  For 0.21 + 0.22, only HADOOP-6840 is needed.  That is a very minor 
change consisting of only small API additions (so, no regression needed) that 
Dhruba & I think the lack of risk will make it easy to approve.  0.20-append 
needs a couple more JIRAs, but that should be even less flack.  As long as we 
ship with 0.20-append HDFS or newer, we'll be fine.
bq.      
bq.      I understand that this diff is early and we will have to wait until we 
ship with the 0.20-append JAR before application.  Basically, I also wanted to 
show that the API change indeed fixes our issue painlessly.  Any +1s on 
HADOOP-6840 would be appreciated :)  Your comment is that we should also 
backward-support 0.20.3?
bq.  
bq.  Todd Lipcon wrote:
bq.      Yea, I think we decided at one point that we should be able to run 
against a vanilla apache cluster, just that it would be "at your own risk" - ie 
that the bug fixes wouldn't necessarily work. EG this is why we do the 
reflection to check for the syncFs() method and warn in the case when it's not 
there, but continue to function.
bq.      
bq.      In this patch, it would actually fail to work at all, since the RPC 
for non-recursive create would get an error at the NN.

We'll ship with a branch-0.20-append but yeah, current thinking is that we 
should be able to run on an hadoop 0.20.x that does not have a working sync.

We've been known to change our minds.  Start a discussion out on dev list if 
want to argue hbase 0.90.x requires a working sync.

- stack

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/396/#review513
-----------------------------------------------------------

> Possible data loss when RS goes into GC pause while rolling HLog
> ----------------------------------------------------------------
>
>                 Key: HBASE-2312
>                 URL: https://issues.apache.org/jira/browse/HBASE-2312
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.3
>            Reporter: Karthik Ranganathan
>            Assignee: Nicolas Spiegelberg
>
> There is a very corner case when bad things could happen(ie data loss):
> 1)    RS #1 is going to roll its HLog - not yet created the new one, old one 
> will get no more writes
> 2)    RS #1 enters GC Pause of Death
> 3)    Master lists HLog files of RS#1 that is has to split as RS#1 is dead, 
> starts splitting
> 4)    RS #1 wakes up, created the new HLog (previous one was rolled) and 
> appends an edit - which is lost
> The following seems like a possible solution:
> 1)    Master detects RS#1 is dead
> 2)    The master renames the /hbase/.logs/<regionserver name>  directory to 
> something else (say /hbase/.logs/<regionserver name>-dead)
> 3)    Add mkdir support (as opposed to mkdirs) to HDFS - so that a file 
> create fails if the directory doesn't exist. Dhruba tells me this is very 
> doable.
> 4)    RS#1 comes back up and is not able create the new hlog. It restarts 
> itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2312) Possible data loss when RS goes into GC pause while rolling HLog

Reply via email to