[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Jieshan Bean (JIRA) Thu, 21 Jul 2011 23:01:49 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jieshan Bean updated HBASE-4095:
--------------------------------

    Attachment: hbase-root-regionserver-193-195-5-111.rar


I attached the full logs of the regionserver.
"hbase-root-regionserver-193-195-5-111.rar":
1. One datanode was down, it tricker the rollWriter.
2. Before the patch:
  private void checkLowReplication() {
    // if the number of replicas in HDFS has fallen below the initial
    // value, then roll logs.
    try {
      int numCurrentReplicas = getLogReplication();
      if (numCurrentReplicas != 0 &&
          numCurrentReplicas < this.initialReplication) {
        LOG.warn("HDFS pipeline error detected. " +
            "Found " + numCurrentReplicas + " replicas but expecting " +
            this.initialReplication + " replicas. " +
            " Requesting close of hlog.");
        requestLogRoll();
        logRollRequested = true;
      }
    } catch (Exception e) {
      LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
          " still proceeding ahead...");
    }
  }
 
  It requested to rollWriter, and set logRollRequested as true.
  But the rollWriter may be blocked this time. So the logRollRequested is 
always true until the next success rollWriter finished.
  During the time , client was writing the data into this RegionServer. If Hlog 
was larger than 256M(Suppose it is), it should tricker a new rollWriter, 
  but each time will skip the request, for logRollRequested was still true:  
        public void sync() throws IOException {
        
           -----
       if (!logRollRequested) {
          checkLowReplication();
          if (this.writer.getLength() > this.logrollsize) {
            requestLogRoll();
          }
       }
    }

> Hlog may not be rolled in a long time if checkLowReplication's request of 
> LogRoll is blocked
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4095
>                 URL: https://issues.apache.org/jira/browse/HBASE-4095
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>            Assignee: Jieshan Bean
>         Attachments: HBASE-4095-90-V3.patch, HBASE-4095-90-v2.patch, 
> HBASE-4095-90.patch, HBASE-4095-trunk-V3.patch, HBASE-4095-trunk-v2.patch, 
> HBASE-4095-trunk.patch, HlogFileIsVeryLarge.gif, 
> hbase-root-regionserver-193-195-5-111.rar, surefire-report-branch.html
>
>
> Some large Hlog files(Larger than 10G) appeared in our environment, and I got 
> the reason why they got so huge:
> 1. The replicas is less than the expect number. So the method of 
> checkLowReplication will be called each sync.
> 2. The method checkLowReplication request log-roll first, and set 
> logRollRequested as true: 
> {noformat}
> private void checkLowReplication() {
> // if the number of replicas in HDFS has fallen below the initial
> // value, then roll logs.
> try {
>   int numCurrentReplicas = getLogReplication();
>   if (numCurrentReplicas != 0 &&
>         numCurrentReplicas < this.initialReplication) {
>       LOG.warn("HDFS pipeline error detected. " +
>               "Found " + numCurrentReplicas + " replicas but expecting " +
>               this.initialReplication + " replicas. " +
>               " Requesting close of hlog.");
>       requestLogRoll();
>       logRollRequested = true;
>   }
> } catch (Exception e) {
>   LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e +
>         " still proceeding ahead...");
> }
> }
> {noformat}
> 3.requestLogRoll() just commit the roll request. It may not execute in time, 
> for it must got the un-fair lock of cacheFlushLock.
> But the lock may be carried by the cacheflush threads.
> 4.logRollRequested was true until the log-roll executed. So during the time, 
> each request of log-roll in sync() was skipped.
> Here's the logs while the problem happened(Please notice the file size of 
> hlog "193-195-5-111%3A20020.1309937386639" in the last row):
> 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
> HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  
> Requesting close of hlog.
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Roll 
> /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119,
>  entries=32434, filesize=239589754. New hlog 
> /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: 
> HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas.  
> Requesting close of hlog.
> 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Renaming flushed file at 
> hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847
>  to 
> hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983
> 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: 
> Added 
> hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983,
>  entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m
> 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
> Finished memstore flush of ~207.5m for region 
> Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in 
> 10839ms, sequenceid=248900, compaction requested=false
> 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Roll 
> /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955,
>  entries=216459, filesize=2370387468. New hlog 
> /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119
> 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Roll 
> /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119,
>  entries=32434, filesize=239589754. New hlog 
> /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: 
> Hlog roll period 3600000ms elapsed
> 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: 
> Hlog roll period 3600000ms elapsed
> 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Roll 
> /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639,
>  entries=1135576, filesize=19220372830. New hlog 
> /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4095) Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked

Reply via email to