[
https://issues.apache.org/jira/browse/HBASE-10499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ramkrishna.s.vasudevan updated HBASE-10499:
-------------------------------------------
Attachment: HBASE-10499_v5.patch
Going thro the comments again and the excellent analysis by [~fenghh] and
[~johnking0099] I can see that the log that comes out of global heap pressure
{code}
2014-02-11 08:44:30,580 INFO [Thread-18] regionserver.MemStoreFlusher: Flush
of region usertable,user1854,1392107806972.e001ae4de867efc4c4c092e7e5a204e1.
due to global heap pressure
{code}
Does not appear for the problematic regions. So this would ensure that we have
not entered the removal from FQE for these regions.
{code}
if (fqe != null && emergencyFlush) {
// Need to remove from region from delay queue. When NOT an
// emergencyFlush, then item was removed via a flushQueue.poll.
flushQueue.remove(fqe);
}
{code}
which means they had been removed from FQE but not from regionQueue in such a
way that FQE was wrongly mapped to an element which was not removed from the
regionQueue. So if region B had to be flushed and it was removed from
regionQueue, but from FQE region A got removed.
So the delay is calculated
{code}
@Override
public long getDelay(TimeUnit unit) {
return unit.convert(this.whenToExpire -
EnvironmentEdgeManager.currentTime(),
TimeUnit.MILLISECONDS);
}
{code}
where whenToExpire is updated in two places, creation of the FRE which is
actually synchronized with the regionQueue. So Ideally we would not have two
regions with the same whenToExpire (considering there is no precision issue)
and in requeue(). The requeue() could update the whenToExpire to the same ts.
But here it has not happened.
If we are going to change the implementation of the equals() in FQE I would
suggest compare if the region is the same and then go for the getDelay check in
the compareTo()
> In write heavy scenario one of the regions does not get flushed causing
> RegionTooBusyException
> ----------------------------------------------------------------------------------------------
>
> Key: HBASE-10499
> URL: https://issues.apache.org/jira/browse/HBASE-10499
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.98.0
> Reporter: ramkrishna.s.vasudevan
> Assignee: ramkrishna.s.vasudevan
> Priority: Critical
> Fix For: 2.0.0, 1.1.0
>
> Attachments: 10499-v2.txt, 10499-v3.txt, 10499-v4.txt,
> HBASE-10499.patch, HBASE-10499_v5.patch, compaction-queue.png,
> hbase-root-master-ip-10-157-0-229.zip,
> hbase-root-regionserver-ip-10-93-128-92.zip, master_4e39.log,
> master_576f.log, rs_4e39.log, rs_576f.log, t1.dump, t2.dump,
> workloada_0.98.dat
>
>
> I got this while testing 0.98RC. But am not sure if it is specific to this
> version. Doesn't seem so to me.
> Also it is something similar to HBASE-5312 and HBASE-5568.
> Using 10 threads i do writes to 4 RS using YCSB. The table created has 200
> regions. In one of the run with 0.98 server and 0.98 client I faced this
> problem like the hlogs became more and the system requested flushes for those
> many regions.
> One by one everything was flushed except one and that one thing remained
> unflushed. The ripple effect of this on the client side
> {code}
> com.yahoo.ycsb.DBException:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
> 54 actions: RegionTooBusyException: 54 times,
> at com.yahoo.ycsb.db.HBaseClient.cleanup(HBaseClient.java:245)
> at com.yahoo.ycsb.DBWrapper.cleanup(DBWrapper.java:73)
> at com.yahoo.ycsb.ClientThread.run(Client.java:307)
> Caused by:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed
> 54 actions: RegionTooBusyException: 54 times,
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:187)
> at
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$500(AsyncProcess.java:171)
> at
> org.apache.hadoop.hbase.client.AsyncProcess.getErrors(AsyncProcess.java:897)
> at
> org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:961)
> at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1225)
> at com.yahoo.ycsb.db.HBaseClient.cleanup(HBaseClient.java:232)
> ... 2 more
> {code}
> On one of the RS
> {code}
> 2014-02-11 08:45:58,714 INFO [regionserver60020.logRoller] wal.FSHLog: Too
> many hlogs: logs=38, maxlogs=32; forcing flush of 23 regions(s):
> 97d8ae2f78910cc5ded5fbb1ddad8492, d396b8a1da05c871edcb68a15608fdf2,
> 01a68742a1be3a9705d574ad68fec1d7, 1250381046301e7465b6cf398759378e,
> 127c133f47d0419bd5ab66675aff76d4, 9f01c5d25ddc6675f750968873721253,
> 29c055b5690839c2fa357cd8e871741e, ca4e33e3eb0d5f8314ff9a870fc43463,
> acfc6ae756e193b58d956cb71ccf0aa3, 187ea304069bc2a3c825bc10a59c7e84,
> 0ea411edc32d5c924d04bf126fa52d1e, e2f9331fc7208b1b230a24045f3c869e,
> d9309ca864055eddf766a330352efc7a, 1a71bdf457288d449050141b5ff00c69,
> 0ba9089db28e977f86a27f90bbab9717, fdbb3242d3b673bbe4790a47bc30576f,
> bbadaa1f0e62d8a8650080b824187850, b1a5de30d8603bd5d9022e09c574501b,
> cc6a9fabe44347ed65e7c325faa72030, 313b17dbff2497f5041b57fe13fa651e,
> 6b788c498503ddd3e1433a4cd3fb4e39, 3d71274fe4f815882e9626e1cfa050d1,
> acc43e4b42c1a041078774f4f20a3ff5
> ......................................................
> 2014-02-11 08:47:49,580 INFO [regionserver60020.logRoller] wal.FSHLog: Too
> many hlogs: logs=53, maxlogs=32; forcing flush of 2 regions(s):
> fdbb3242d3b673bbe4790a47bc30576f, 6b788c498503ddd3e1433a4cd3fb4e39
> {code}
> {code}
> 2014-02-11 09:42:44,237 INFO [regionserver60020.periodicFlusher]
> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
> flush for region
> usertable,user3654,1392107806977.fdbb3242d3b673bbe4790a47bc30576f. after a
> delay of 16689
> 2014-02-11 09:42:44,237 INFO [regionserver60020.periodicFlusher]
> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
> flush for region
> usertable,user6264,1392107806983.6b788c498503ddd3e1433a4cd3fb4e39. after a
> delay of 15868
> 2014-02-11 09:42:54,238 INFO [regionserver60020.periodicFlusher]
> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
> flush for region
> usertable,user3654,1392107806977.fdbb3242d3b673bbe4790a47bc30576f. after a
> delay of 20847
> 2014-02-11 09:42:54,238 INFO [regionserver60020.periodicFlusher]
> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
> flush for region
> usertable,user6264,1392107806983.6b788c498503ddd3e1433a4cd3fb4e39. after a
> delay of 20099
> 2014-02-11 09:43:04,238 INFO [regionserver60020.periodicFlusher]
> regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
> flush for region
> usertable,user3654,1392107806977.fdbb3242d3b673bbe4790a47bc30576f. after a
> delay of 8677
> {code}
> {code}
> 2014-02-11 10:31:21,020 INFO [regionserver60020.logRoller] wal.FSHLog: Too
> many hlogs: logs=54, maxlogs=32; forcing flush of 1 regions(s):
> fdbb3242d3b673bbe4790a47bc30576f
> {code}
> I restarted another RS and there were region movements with other regions but
> this region stays with the RS that has this issue. One important observation
> is that in HRegion.internalflushCache() we need to add a debug log here
> {code}
> // If nothing to flush, return and avoid logging start/stop flush.
> if (this.memstoreSize.get() <= 0) {
> return false;
> }
> {code}
> Because we can see that the region is requsted for a flush but it does not
> happen and no logs related to flush are printed in the logs. so due to some
> reason this memstore.size() has become 0( I assume this). The earlier bugs
> were also due to similar reason.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)