Hi, all Sorry for the bothering. The problem is caused by a mistake in applying patch.
Best regards, Mickey 2013/8/1 Mickey <[email protected]> > Hi, all > We met an extremely long gc time problem which make all the region servers > go down. > From the region server's log, I found lots of logs as follows: > > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for > region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., > flushing=true, writesEnabled=true > 2013-07-30 19:31:12,223 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. > due to global heap pressure > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for > region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., > flushing=true, writesEnabled=true > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > because memory above low water=11.2g > 2013-07-30 19:31:12,223 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. > due to global heap pressure > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for > region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., > flushing=true, writesEnabled=true > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > because memory above low water=11.2g > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > because memory above low water=11.2g > 2013-07-30 19:31:12,223 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. > due to global heap pressure > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for > region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., > flushing=true, writesEnabled=true > 2013-07-30 19:31:12,223 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. > due to global heap pressure > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for > region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., > flushing=true, writesEnabled=true > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > because memory above low water=11.2g > 2013-07-30 19:31:12,223 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. > due to global heap pressure > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for > region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., > flushing=true, writesEnabled=true > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > because memory above low water=11.2g > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > because memory above low water=11.2g > 2013-07-30 19:31:12,223 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. > due to global heap pressure > 2013-07-30 19:31:12,223 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for > region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., > flushing=true, writesEnabled=true > 2013-07-30 19:31:12,224 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. > due to global heap pressure > 2013-07-30 19:31:12,224 DEBUG > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up > because memory above low water=11.2g > 2013-07-30 19:31:12,224 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for > region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05., > flushing=true, writesEnabled=true > 2013-07-30 19:31:12,224 INFO > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region > TEST_0,8613786870935_425284498467_53049226674,1373103980810.b5bc4c557faf21d71afc79c4e7f91f05. > due to global heap pressure > > It sometimes produce more than 1GB logs in one minute. And after some > times, the region server will do a extremely long time gc after promotion > fail. > > In our environment, there is a datanode with bad network condition which > always lost packages. Region server flushed hfile very slowly when it try > to flush file to the bad datanode. After remove the datanode, the situation > turns good now. > > My guess: Too may flushOneForGlobalPressure in this situation. It produce > too many Strings(logs) and other objects. > > > I check the code in the MemStoreFlusher in hbase 0.94 branch(trunk seems > the same): > > public void run() { > while (!this.server.isStopped()) { > FlushQueueEntry fqe = null; > try { > wakeupPending.set(false); // allow someone to wake us up again > fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS); > if (fqe == null || fqe instanceof WakeupFlushThread) { > if (isAboveLowWaterMark()) { > LOG.debug("Flush thread woke up because memory above low > water=" + > > StringUtils.humanReadableInt(this.globalMemStoreLimitLowMark)); > if (!flushOneForGlobalPressure()) { > // Wasn't able to flush any region, but we're above low > water mark > // This is unlikely to happen, but might happen when closing > the > // entire server - another thread is flushing regions. We'll > just > // sleep a little bit to avoid spinning, and then pretend > that > // we flushed one, so anyone blocked will check again > lock.lock(); > try { > Thread.sleep(1000); > flushOccurred.signalAll(); > } finally { > lock.unlock(); > } > } > // Enqueue another one of these tokens so we'll wake up again > wakeupFlushThread(); > } > continue; > } > FlushRegionEntry fre = (FlushRegionEntry)fqe; > if (!flushRegion(fre)) { > break; > } > } catch (InterruptedException ex) { > continue; > } catch (ConcurrentModificationException ex) { > continue; > } catch (Exception ex) { > LOG.error("Cache flusher failed for entry " + fqe, ex); > if (!server.checkFileSystem()) { > break; > } > } > } > ... > } > > The while loop in this case will quickly finished and then again. > > Is my guess reasonable? > If so, should we do a short sleep in the loop? Or there are some > configurations we can set? > I know if the region server is quite busy, it may affect the insert > operations. In my opinion, blocking the insert operation is more acceptable > than the server goes down. > > Hope to get your suggestions. > > Thanks, > Mickey >
