[
https://issues.apache.org/jira/browse/HBASE-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13916482#comment-13916482
]
Nick Dimiduk commented on HBASE-10643:
--------------------------------------
The RS initializes its connection to ZooKeeper very early in the startup
process. The BucketCache isn't created until after assignments have been
received and the first storefile is opened. Thus, the pause in allocating
direct memory comes after the zk session is established.
One option would be to add a configuration point that, when enabled, will ask
the RS to initialize it's cache structures before the ZK session is
established. That way the pause won't disrupt the session. Other ideas?
Note that this should not be an issue with the BucketcCache running in file
mode.
> Failure in RS when using large size bucketcache
> -----------------------------------------------
>
> Key: HBASE-10643
> URL: https://issues.apache.org/jira/browse/HBASE-10643
> Project: HBase
> Issue Type: Bug
> Components: regionserver
> Affects Versions: 0.98.0, 0.96.0
> Reporter: Biju Nair
> Labels: bucketCache, regionserver
>
> When RS is brought up with XX:MaxDirectMemorySize of 22GB or higher, RS fails
> after a successful start. From the RS logs it looks like the bucketCache
> memory allocation is taking more time makes the RS considered dead by ZK. One
> option to fix the problem would be to allocate the bucketCache before
> registering with ZK.
> 2014-02-28 18:54:42,967 WARN [regionserver60020.compactionChecker]
> util.Sleeper: We slept 33496ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2014-02-28 18:54:42,967 WARN [regionserver60020.periodicFlusher]
> util.Sleeper: We slept 33496ms instead of 10000ms, this is likely due to a
> long garbage collecting pause and it's usually bad, see
> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> 2014-02-28 18:54:42,967 WARN [JvmPauseMonitor] util.JvmPauseMonitor:
> Detected pause in JVM or host machine (eg GC): pause of approximately 23988ms
> GC pool 'ParNew' had collection(s): count=1 time=24432ms
> 2014-02-28 18:54:43,006 FATAL [regionserver60020] regionserver.HRegionServer:
> ABORTING region server bbg-master2.bbg-test.hdp,60020,1393628951236:
> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> currently processing bbg-master2.bbg-test.hdp,60020,1393628951236 as dead
> server
> at
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:341)
> at
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:254)
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)