[jira] [Commented] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

chunhui shen (Commented) (JIRA) Thu, 24 Nov 2011 18:39:06 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156965#comment-13156965
 ]


chunhui shen commented on HBASE-4862:
-------------------------------------

@Ted Yu @Todd Lipcon

It will happen concurrently in the following case:
1.Move region from server A to server B (for example,do balance)
2.kill server A and Server B
3.restart server A and Server B immediately

Before we restart server A and Server B, log data about this region appear in 
the both server's log file,
4.After we restart server B, serverShutdownHandler process this dead server , 
and assign this region,
5.At the same time, serverShutdownHandler would process dead server B, and 
split server B's hlog
because 4 and 5 is concurrent, replayRecoveredEditsIfAny in 4 and appending log 
entry for this region's
recoverd.edit file are concurrent. So, when the recoverd.edit file deleted by 
replayRecoveredEdits, exception is thrown.

master and region server log in this case as the following:

master log： 
2011-11-16 11:50:13,037 FATAL 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: WriterThread-1 Got while 
writing log entry to log 
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: 
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
/hbase-common/writetest1/3591e9867a4c125493dc82168854ea0c/recovered.edits/0000000013156791680
 File does not exist. [Lease. Holder: 
DFSClient_hb_m_dw75.kgb.sqa.cm4:60000_1321413286871, pendingcreates: 54] 
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1542)
 
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1533)
 
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1449)
 
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:649) 
        at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) 
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 
        at java.lang.reflect.Method.invoke(Method.java:597) 
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) 
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415) 
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411) 
        at java.security.AccessController.doPrivileged(Native Method) 
        at javax.security.auth.Subject.doAs(Subject.java:396) 
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
 
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409) 

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
Method) 
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513) 
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96)
 
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:49)
 
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
 
        at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.writeBuffer(HLogSplitter.java:962)
 
        at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.doRun(HLogSplitter.java:926)
 
        at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$WriterThread.run(HLogSplitter.java:898)
 



regionserver log： 
2011-11-16 11:49:49,727 ERROR org.apache.hadoop.hbase.regionserver.HRegion: 
Failed delete of 
hdfs://dw74.kgb.sqa.cm4:9000/hbase-common/writetest1/3591e9867a4c125493dc82168854ea0c/recovered.edits/0000000013156791680
2011-11-16 11:49:49,732 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Deleted recovered.edits 
file=hdfs://dw74.kgb.sqa.cm4:9000/hbase-common/writetest1/3591e9867a4c125493dc82168854ea0c/recovered.edits/0000000013156800103
                
> Splitting hlog and opening region concurrently may cause data loss
> ------------------------------------------------------------------
>
>                 Key: HBASE-4862
>                 URL: https://issues.apache.org/jira/browse/HBASE-4862
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: chunhui shen
>            Assignee: chunhui shen
>             Fix For: 0.92.0, 0.94.0, 0.90.5
>
>         Attachments: 4862.patch
>
>
> Case Description:
> 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
> and is appending log entry
> 2.Regionserver is opening region A now, and in the process 
> replayRecoveredEditsIfAny() ,it will delete the file region 
> A/recoverd.edits/123456 
> 3.Split hlog thread catches the io exception, and stop parse this log file 
> and if skipError = true , add it to the corrupt logs....However, data in 
> other regions in this log file will loss 
> 4.Or if skipError = false, it will check filesystem.Of course, the file 
> system is ok , and it only prints a error log, continue assigning regions. 
> Therefore, data in other log files will also loss!!
> The case may happen in the following:
> 1.Move region from server A to server B
> 2.kill server A and Server B
> 3.restart server A and Server B
> We could prevent this exception throuth forbiding deleting  recover.edits 
> file 
> which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

Reply via email to