[
https://issues.apache.org/jira/browse/HBASE-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268142#comment-16268142
]
Ashish Singhi edited comment on HBASE-19320 at 11/28/17 5:30 AM:
-----------------------------------------------------------------
We have faced this issue couple of times in our clusters with replication, were
replication was getting stuck because of this (Base version of HBase was
1.0.2)..
Possible cause,
If heavy writes to the source cluster are observed, RegionServers in the peer
cluster are unable to replicate data over a long period of time either due to
RegionServer shutdown, or DataNode faults, or network issues. When the
RegionServer functionality is restored, it may receive a replication request
with a lot of edits in it. Consequently, RegionServers in the peer cluster
takes more time than hbase.rpc.replication.timeout to process the replication
request. Once the time spent in processing the replication request exceeds the
time limit, RegionServer in the source cluster will resend the request,
incurring the possibility that the direct buffer memory (default size is 64 MB)
gets too full to serve any requests further.
Logs in source cluster RS were something like this,
{noformat}
2016-05-26 10:12:00,367 | WARN | regionserver/XX/XX:21302.replicationSource,33
| Can't replicate because of an error on the remote cluster: |
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:275)
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.RpcServer$CallQueueTooBigException):
Call queue is full on /XX:21302, is hbase.ipc.server.max.callqueue.size too
small?
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1277)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:324)
at
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:25690)
at
org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:79)
at
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:381)
at
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:364)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
In peer cluster RS like this,
{noformat}
2016-05-26 10:12:21,971 | INFO | pool-5614-thread-1 | #2, waiting for 7232
actions to finish |
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:31,990 | INFO | pool-5614-thread-1 | #2, waiting for 7232
actions to finish |
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:42,006 | INFO | pool-5614-thread-1 | #2, waiting for 7232
actions to finish |
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:52,029 | INFO | pool-5614-thread-1 | #2, waiting for 7232
actions to finish |
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
{noformat}
The procedure we used to avoid this issue was,
Add -XX:MaxDirectMemorySize=1024m to the HBASE_OPTS in the
$HBASE_CONF_DIR/hbase-env.sh of each RegionServer process in peer cluster and
then restart all these processes.
After increasing MaxDirectMemorySize we have not faced this issue again, its
been more than a year now.
was (Author: ashish singhi):
We have faced this issue couple of times in our clusters with replication, were
replication was getting stuck because of this (Base version of HBase was
1.0.2)..
Possible cause,
If heavy writes to the source cluster are observed, RegionServers in the peer
cluster are unable to replicate data over a long period of time either due to
RegionServer shutdown, or DataNode faults, or network issues. When the
RegionServer functionality is restored, it may receive a replication request
with a lot of edits in it. Consequently, RegionServers in the peer cluster
takes more time than hbase.rpc.replication.timeout to process the replication
request. Once the time spent in processing the replication request exceeds the
time limit, RegionServer in the source cluster will resend the request,
incurring the possibility that the direct buffer memory (default size is 64 MB)
gets too full to serve any requests further.
Logs in source cluster RS were something like this,
{noformat}
2016-05-26 10:12:00,367 | WARN | regionserver/XX/XX:21302.replicationSource,33
| Can't replicate because of an error on the remote cluster: |
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:275)
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.RpcServer$CallQueueTooBigException):
Call queue is full on /XX:21302, is hbase.ipc.server.max.callqueue.size too
small?
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1277)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:324)
at
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:25690)
at
org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:79)
at
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:381)
at
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:364)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}
In peer cluster RS like this,
{noformat}
2016-05-26 10:12:21,971 | INFO | pool-5614-thread-1 | #2, waiting for 7232
actions to finish |
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:31,990 | INFO | pool-5614-thread-1 | #2, waiting for 7232
actions to finish |
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:42,006 | INFO | pool-5614-thread-1 | #2, waiting for 7232
actions to finish |
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:52,029 | INFO | pool-5614-thread-1 | #2, waiting for 7232
actions to finish |
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
{noformat}
The procedure we used to avoid this issue was,
Add -XX:MaxDirectMemorySize=1024m to the HBASE_OPTS in the
${HBASE_HOME}/conf/hbase-env.sh of each RegionServer process in peer cluster
and then restart all these processes.
After increasing MaxDirectMemorySize we have not faced this issue again, its
been more than a year now.
> document the mysterious direct memory leak in hbase
> ----------------------------------------------------
>
> Key: HBASE-19320
> URL: https://issues.apache.org/jira/browse/HBASE-19320
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.0.0, 1.2.6
> Reporter: huaxiang sun
> Assignee: huaxiang sun
> Attachments: Screen Shot 2017-11-21 at 4.43.36 PM.png, Screen Shot
> 2017-11-21 at 4.44.22 PM.png
>
>
> Recently we run into a direct memory leak case, which takes some time to
> trace and debug. Internally discussed with our [[email protected]], we
> thought we had some findings and want to share with the community.
> Basically, it is the issue described in
> http://www.evanjones.ca/java-bytebuffer-leak.html and it happened to one of
> our hbase clusters.
> Create the jira first and will fill in more details later.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)