[jira] [Commented] (HBASE-9813) Log splitting doesn't prevent RS creating new hlog file

Hadoop QA (JIRA) Mon, 21 Oct 2013 21:13:25 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13801453#comment-13801453
 ]


Hadoop QA commented on HBASE-9813:
----------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12609583/trunk-9813.patch
  against trunk revision .

    {color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

    {color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
                        Please justify why no new tests are needed for this 
patch.
                        Also please list what manual steps were performed to 
verify this patch.

    {color:green}+1 hadoop1.0{color}.  The patch compiles against the hadoop 
1.0 profile.

    {color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

    {color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

    {color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

    {color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

    {color:red}-1 site{color}.  The patch appears to cause mvn site goal to 
fail.

     {color:red}-1 core tests{color}.  The patch failed these unit tests:
                       org.apache.hadoop.hbase.regionserver.TestHRegion
                  org.apache.hadoop.hbase.regionserver.wal.TestHLog
                  org.apache.hadoop.hbase.regionserver.TestHRegionBusyWait

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/7597//console

This message is automatically generated.

> Log splitting doesn't prevent RS creating new hlog file
> -------------------------------------------------------
>
>                 Key: HBASE-9813
>                 URL: https://issues.apache.org/jira/browse/HBASE-9813
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jimmy Xiang
>            Assignee: Jimmy Xiang
>         Attachments: nn.log, trunk-9813.patch, trunk-9813.patch
>
>
> Here is something weird happened to my cluster.  When the master recovered, 
> it thought RS was dead by mistake and SSH processed it.  Four hlog files were 
> split and 4 regions were reassigned.
> Later on when the RS tried to report status, it got YouAreDeadException from 
> master.  Master tried to recover it again.  Now SSH split one more hlog file, 
> assigned 0 more extra region.
> Now the question is how come is the extra log file?  I think it caused a data 
> loss (one row) in my case. How can we prevent it?  I can add a checking in 
> the master to make sure a RS is really dead before marking it dead during 
> recovery since it may be slow in reporting to the new master. But this is not 
> a proper fix.  A proper fix should be in the RS side, not creating the extra 
> hlog file.  Is this doable?
> Here is the related master log.  The RS log does not have much thing 
> interested.
> {noformat}
> 2013-10-20 17:40:33,025 INFO  [master:e1119:36000] master.MasterFileSystem: 
> Log folder 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312
>  doesn't belong to a known region server, splitting
> 2013-10-20 17:40:33,432 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> handler.ServerShutdownHandler: Splitting logs for 
> e1521.halxg.cloudera.com,36020,1382314725312 before assignment.
> 2013-10-20 17:40:33,461 DEBUG [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.MasterFileSystem: Renamed region directory: 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting
> 2013-10-20 17:40:33,462 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.SplitLogManager: dead splitlog workers 
> [e1521.halxg.cloudera.com,36020,1382314725312]
> 2013-10-20 17:40:33,467 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.SplitLogManager: started splitting 3 logs in 
> [hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting]
> 2013-10-20 17:40:36,157 DEBUG [main-EventThread] wal.HLogSplitter: Archived 
> processed log 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315954499
>  to 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/oldWALs/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315954499
> 2013-10-20 17:40:37,357 DEBUG [main-EventThread] wal.HLogSplitter: Archived 
> processed log 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315934942
>  to 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/oldWALs/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315934942
> 2013-10-20 17:40:38,889 DEBUG [main-EventThread] wal.HLogSplitter: Archived 
> processed log 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315971393
>  to 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/oldWALs/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315971393
> 2013-10-20 17:40:38,906 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.SplitLogManager: finished splitting (more than or equal to) 255322969 
> bytes in 3 log files in 
> [hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting]
>  in 5439ms
> 2013-10-20 17:40:38,907 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> handler.ServerShutdownHandler: Reassigning 4 region(s) that 
> e1521.halxg.cloudera.com,36020,1382314725312 was carrying (and 0 regions(s) 
> that were opening on this server)
> 2013-10-20 17:40:39,269 DEBUG [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.DeadServer: Finished processing 
> e1521.halxg.cloudera.com,36020,1382314725312
> 2013-10-20 17:40:39,269 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> handler.ServerShutdownHandler: Finished processing of shutdown of 
> e1521.halxg.cloudera.com,36020,1382314725312
> 2013-10-20 17:40:46,764 DEBUG [FifoRpcScheduler.handler1-thread-11] 
> master.ServerManager: Server REPORT rejected; currently processing 
> e1521.halxg.cloudera.com,36020,1382314725312 as dead server
> 2013-10-20 17:40:46,871 ERROR [FifoRpcScheduler.handler1-thread-13] 
> master.HMaster: Region server e1521.halxg.cloudera.com,36020,1382314725312 
> reported a fatal error:
> ABORTING region server e1521.halxg.cloudera.com,36020,1382314725312: 
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing e1521.halxg.cloudera.com,36020,1382314725312 as dead 
> server
> org.apache.hadoop.hbase.YouAreDeadException: 
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing e1521.halxg.cloudera.com,36020,1382314725312 as dead 
> server
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
>  org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing e1521.halxg.cloudera.com,36020,1382314725312 as dead 
> server
> 2013-10-20 17:40:46,872 ERROR [FifoRpcScheduler.handler1-thread-14] 
> master.HMaster: Region server e1521.halxg.cloudera.com,36020,1382314725312 
> reported a fatal error:
> ABORTING region server e1521.halxg.cloudera.com,36020,1382314725312: IOE in 
> log roller
> java.io.FileNotFoundException: File does not exist: 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382315971393
> 2013-10-20 17:40:48,075 ERROR [FifoRpcScheduler.handler1-thread-15] 
> master.HMaster: Region server e1521.halxg.cloudera.com,36020,1382314725312 
> reported a fatal error:
> ABORTING region server e1521.halxg.cloudera.com,36020,1382314725312: 
> regionserver:36020-0x1419fea35edda89 regionserver:36020-0x1419fea35edda89 
> received expired from ZooKeeper, aborting
> 2013-10-20 17:53:15,272 INFO  [master:e1119:36000] master.MasterFileSystem: 
> Log folder 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312
>  doesn't belong to a known region server, splitting
> 2013-10-20 17:53:15,806 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> handler.ServerShutdownHandler: Splitting logs for 
> e1521.halxg.cloudera.com,36020,1382314725312 before assignment.
> 2013-10-20 17:53:15,825 DEBUG [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.MasterFileSystem: Renamed region directory: 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting
> 2013-10-20 17:53:15,825 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.SplitLogManager: dead splitlog workers 
> [e1521.halxg.cloudera.com,36020,1382314725312]
> 2013-10-20 17:53:15,831 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.SplitLogManager: started splitting 1 logs in 
> [hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting]
> 2013-10-20 17:53:15,998 DEBUG [main-EventThread] wal.HLogSplitter: Archived 
> processed log 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382316046805
>  to 
> hdfs://e1119.halxg.cloudera.com:35802/hbase/oldWALs/e1521.halxg.cloudera.com%2C36020%2C1382314725312.1382316046805
> 2013-10-20 17:53:16,014 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> master.SplitLogManager: finished splitting (more than or equal to) 15 bytes 
> in 1 log files in 
> [hdfs://e1119.halxg.cloudera.com:35802/hbase/WALs/e1521.halxg.cloudera.com,36020,1382314725312-splitting]
>  in 183ms
> 2013-10-20 17:53:16,015 INFO  [MASTER_SERVER_OPERATIONS-e1119:36000-0] 
> handler.ServerShutdownHandler: Reassigning 0 region(s) that 
> e1521.halxg.cloudera.com,36020,1382314725312 was carrying (and 0 regions(s) 
> that were opening on this server)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HBASE-9813) Log splitting doesn't prevent RS creating new hlog file

Reply via email to