[jira] [Commented] (HBASE-7982) TestReplicationQueueFailover* runs for a minute, spews 3/4million lines complaining 'Filesystem closed', has an NPE, and still passes?

Jeffrey Zhong (JIRA) Mon, 04 Mar 2013 23:45:37 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-7982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593149#comment-13593149
 ]


Jeffrey Zhong commented on HBASE-7982:
--------------------------------------

Finally I got the root cause of the huge log files. It's a tricky issue. 

When RS aborts, one LogSyncer doesn't quit after we close lease so it throw 
following exceptions in a while loop:
{code}
2013-02-26 09:55:13,951 FATAL 
[RegionServer:1;juno.apache.org,53545,1361872469173.logSyncer] 
wal.FSHLog(1140): Could not sync. Requesting close of hlog
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319)
        at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843)
        at 
org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
        at 
org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999)
        at 
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228)
        at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010)
        at java.lang.Thread.run(Thread.java:722)
2013-02-26 09:55:13,951 ERROR 
[RegionServer:1;juno.apache.org,53545,1361872469173.logSyncer] 
wal.FSHLog$LogSyncer(1012): Error while syncing, requesting close of hlog 
{code}

The reason we left one LogSyncer instance open as you can see in the following 
line:
{code}
  private void closeWAL(final boolean delete) {
    try {
      if (this.hlogForMeta != null) {
        //All hlogs (meta and non-meta) are in the same directory. Don't call 
        //closeAndDelete here since that would delete all hlogs not just the 
        //meta ones. We will just 'close' the hlog for meta here, and leave
        //the directory cleanup to the follow-on closeAndDelete call.
        this.hlogForMeta.close();
      }
      if (this.hlog != null) {
        if (delete) {
          hlog.closeAndDelete();
        } else {
          hlog.close();
        }
      }
    } catch (Throwable e) {
      LOG.error("Close and delete failed", 
RemoteExceptionHandler.checkThrowable(e));
    }
  }
{code}

When {code}this.hlogForMeta.close();{code} throws exception then the hlog won't 
be closed. Below is the log message which shows hlogForMeta.close() throws an 
exception:

{code}
2013-02-26 09:55:13,335 ERROR [IPC Server handler 0 on 56091] 
security.UserGroupInformation(1139): PriviledgedActionException 
as:jenkins.hfs.1 
cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
/user/jenkins/hbase/.logs/juno.apache.org,53545,1361872469173/juno.apache.org%2C53545%2C1361872469173.1361872473259.meta
 File does not exist. [Lease.  Holder: 
DFSClient_hb_rs_juno.apache.org,53545,1361872469173_-70821669_153, 
pendingcreates: 2]
2013-02-26 09:55:13,336 ERROR 
[RegionServer:1;juno.apache.org,53545,1361872469173] 
regionserver.HRegionServer(1178): Close and delete failed
{code}

In addition, I also fixed a potential loop situation in the replication source 
code. 
                
> TestReplicationQueueFailover* runs for a minute, spews 3/4million lines 
> complaining 'Filesystem closed', has an NPE, and still passes?
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-7982
>                 URL: https://issues.apache.org/jira/browse/HBASE-7982
>             Project: HBase
>          Issue Type: Bug
>          Components: build
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Blocker
>         Attachments: hbase-7982-huge-logging.patch, hbase-7982-NPE.patch
>
>
> I was trying to look at why the odd time Hudson OOMEs trying to make a report 
> on 0.95 build #4 https://builds.apache.org/job/hbase-0.95/4/console:
> {code}
> ERROR: Failed to archive test reports
> hudson.util.IOException2: remote file operation failed: 
> /home/jenkins/jenkins-slave/workspace/hbase-0.95 at 
> hudson.remoting.Channel@151a4e3e:ubuntu3
>       at hudson.FilePath.act(FilePath.java:861)
>       at hudson.FilePath.act(FilePath.java:838)
>       at hudson.tasks.junit.JUnitParser.parse(JUnitParser.java:87)
>       at 
> ...
> Caused by: java.lang.OutOfMemoryError: Java heap space
>       at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
>       at java.nio.CharBuffer.allocate(CharBuffer.java:329)
>       at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
>       at java.nio.charset.Charset.decode(Charset.java:791)
>       at hudson.tasks.junit.SuiteResult.<init>(SuiteResult.java:215)
> ...
> {code}
> We are trying to allocate a big buffer and failing.
> Looking at reports being generated, we have quite a few that are > 10MB in 
> size:
> {code}
> durruti:0.95 stack$ find hbase-* -type f -size +10000k -exec ls -la {} \;
> -rw-r--r--@ 1 stack  staff  11126492 Feb 27 06:14 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.backup.TestHFileArchiving-output.txt
> -rw-r--r--@ 1 stack  staff  13296009 Feb 27 05:47 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestFromClientSide3-output.txt
> -rw-r--r--@ 1 stack  staff  10541898 Feb 27 05:47 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestMultiParallel-output.txt
> -rw-r--r--@ 1 stack  staff  25344601 Feb 27 05:51 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.client.TestRestoreSnapshotFromClient-output.txt
> -rw-r--r--@ 1 stack  staff  17966969 Feb 27 06:12 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.TestEndToEndSplitTransaction-output.txt
> -rw-r--r--@ 1 stack  staff  17699068 Feb 27 06:09 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplit-output.txt
> -rw-r--r--@ 1 stack  staff  17701832 Feb 27 06:07 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.wal.TestHLogSplitCompressed-output.txt
> -rw-r--r--@ 1 stack  staff  717853709 Feb 27 06:17 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailover-output.txt
> -rw-r--r--@ 1 stack  staff  563616793 Feb 27 06:17 
> hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestReplicationQueueFailoverCompressed-output.txt
> {code}
> ... with TestReplicationQueueFailover* being order of magnitude bigger than 
> the others.
> Looking in the test I see both spewing between 800 and 900 thousand lines in 
> about a minute.  Here is their fixation:
> {code}
> 8908998 2013-02-27 06:17:48,176 ERROR 
> [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] 
> wal.FSHLog$LogSyncer(1012): Error while syncing, requesting close of hlog.
> 8908999 java.io.IOException: Filesystem closed
> 8909000 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319)
> 8909001 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78)
> 8909002 ,...at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843)
> 8909003 ,...at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> 8909004 ,...at 
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999)
> 8909005 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248)
> 8909006 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120)
> 8909007 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058)
> 8909008 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228)
> 8909009 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010)
> 8909010 ,...at java.lang.Thread.run(Thread.java:722)
> 8909011 2013-02-27 06:17:48,176 FATAL 
> [RegionServer:1;hemera.apache.org,35712,1361945801803.logSyncer] 
> wal.FSHLog(1140): Could not sync. Requesting close of hlog
> 8909012 java.io.IOException: Filesystem closed
> 8909013 ,...at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:319)
> 8909014 ,...at org.apache.hadoop.hdfs.DFSClient.access$1200(DFSClient.java:78)
> 8909015 ,...at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3843)
> 8909016 ,...at 
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> 8909017 ,...at 
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:999)
> 8909018 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:248)
> 8909019 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1120)
> 8909020 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.syncer(FSHLog.java:1058)
> 8909021 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog.sync(FSHLog.java:1228)
> 8909022 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.FSHLog$LogSyncer.run(FSHLog.java:1010)
> 8909023 ,...at java.lang.Thread.run(Thread.java:722)
> ...
> {code}
> These tests are 'succeeding'?
> I also see in both:
> {code}
>    3891 java.lang.NullPointerException
>    3892 ,...at 
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.seek(SequenceFileLogReader.java:261)
>    3893 ,...at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.seek(ReplicationHLogReaderManager.java:103)
>    3894 ,...at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.readAllEntriesToReplicateOrNextFile(ReplicationSource.java:415)
>    3895 ,...at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:333)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-7982) TestReplicationQueueFailover* runs for a minute, spews 3/4million lines complaining 'Filesystem closed', has an NPE, and still passes?

Reply via email to