[jira] [Comment Edited] (HBASE-19320) document the mysterious direct memory leak in hbase

Ashish Singhi (JIRA) Mon, 27 Nov 2017 21:31:38 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-19320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16268142#comment-16268142
 ]


Ashish Singhi edited comment on HBASE-19320 at 11/28/17 5:30 AM:
-----------------------------------------------------------------

We have faced this issue couple of times in our clusters with replication, were 
replication was getting stuck because of this (Base version of HBase was 
1.0.2)..
Possible cause,
If heavy writes to the source cluster are observed, RegionServers in the peer 
cluster are unable to replicate data over a long period of time either due to 
RegionServer shutdown, or DataNode faults, or network issues. When the 
RegionServer functionality is restored, it may receive a replication request 
with a lot of edits in it. Consequently, RegionServers in the peer cluster 
takes more time than hbase.rpc.replication.timeout to process the replication 
request. Once the time spent in processing the replication request exceeds the 
time limit, RegionServer in the source cluster will resend the request, 
incurring the possibility that the direct buffer memory (default size is 64 MB) 
gets too full to serve any requests further.

Logs in source cluster RS were something like this,
{noformat}
2016-05-26 10:12:00,367 | WARN  | regionserver/XX/XX:21302.replicationSource,33 
| Can't replicate because of an error on the remote cluster:  | 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:275)
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.RpcServer$CallQueueTooBigException):
 Call queue is full on /XX:21302, is hbase.ipc.server.max.callqueue.size too 
small?
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1277)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:324)
at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:25690)
at 
org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:79)
at 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:381)
at 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:364)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

In peer cluster RS like this,
{noformat}
2016-05-26 10:12:21,971 | INFO  | pool-5614-thread-1 | #2, waiting for 7232  
actions to finish | 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:31,990 | INFO  | pool-5614-thread-1 | #2, waiting for 7232  
actions to finish | 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:42,006 | INFO  | pool-5614-thread-1 | #2, waiting for 7232  
actions to finish | 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:52,029 | INFO  | pool-5614-thread-1 | #2, waiting for 7232  
actions to finish | 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
{noformat}

The procedure we used to avoid this issue was,
Add -XX:MaxDirectMemorySize=1024m to the HBASE_OPTS in the 
$HBASE_CONF_DIR/hbase-env.sh of each RegionServer process in peer cluster and 
then restart all these processes.
After increasing MaxDirectMemorySize we have not faced this issue again, its 
been more than a year now.


was (Author: ashish singhi):
We have faced this issue couple of times in our clusters with replication, were 
replication was getting stuck because of this (Base version of HBase was 
1.0.2)..
Possible cause,
If heavy writes to the source cluster are observed, RegionServers in the peer 
cluster are unable to replicate data over a long period of time either due to 
RegionServer shutdown, or DataNode faults, or network issues. When the 
RegionServer functionality is restored, it may receive a replication request 
with a lot of edits in it. Consequently, RegionServers in the peer cluster 
takes more time than hbase.rpc.replication.timeout to process the replication 
request. Once the time spent in processing the replication request exceeds the 
time limit, RegionServer in the source cluster will resend the request, 
incurring the possibility that the direct buffer memory (default size is 64 MB) 
gets too full to serve any requests further.

Logs in source cluster RS were something like this,
{noformat}
2016-05-26 10:12:00,367 | WARN  | regionserver/XX/XX:21302.replicationSource,33 
| Can't replicate because of an error on the remote cluster:  | 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:275)
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.RpcServer$CallQueueTooBigException):
 Call queue is full on /XX:21302, is hbase.ipc.server.max.callqueue.size too 
small?
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1277)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:324)
at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:25690)
at 
org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:79)
at 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:381)
at 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint$Replicator.call(HBaseInterClusterReplicationEndpoint.java:364)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

In peer cluster RS like this,
{noformat}
2016-05-26 10:12:21,971 | INFO  | pool-5614-thread-1 | #2, waiting for 7232  
actions to finish | 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:31,990 | INFO  | pool-5614-thread-1 | #2, waiting for 7232  
actions to finish | 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:42,006 | INFO  | pool-5614-thread-1 | #2, waiting for 7232  
actions to finish | 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
2016-05-26 10:12:52,029 | INFO  | pool-5614-thread-1 | #2, waiting for 7232  
actions to finish | 
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.waitUntilDone(AsyncProcess.java:1572)
{noformat}

The procedure we used to avoid this issue was,
Add -XX:MaxDirectMemorySize=1024m to the HBASE_OPTS in the 
${HBASE_HOME}/conf/hbase-env.sh of each RegionServer process in peer cluster 
and then restart all these processes.
After increasing MaxDirectMemorySize we have not faced this issue again, its 
been more than a year now.

> document the mysterious direct memory leak in hbase 
> ----------------------------------------------------
>
>                 Key: HBASE-19320
>                 URL: https://issues.apache.org/jira/browse/HBASE-19320
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.0.0, 1.2.6
>            Reporter: huaxiang sun
>            Assignee: huaxiang sun
>         Attachments: Screen Shot 2017-11-21 at 4.43.36 PM.png, Screen Shot 
> 2017-11-21 at 4.44.22 PM.png
>
>
> Recently we run into a direct memory leak case, which takes some time to 
> trace and debug. Internally discussed with our [[email protected]], we 
> thought we had some findings and want to share with the community.
> Basically, it is the issue described in 
> http://www.evanjones.ca/java-bytebuffer-leak.html and it happened to one of 
> our hbase clusters.
> Create the jira first and will fill in more details later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HBASE-19320) document the mysterious direct memory leak in hbase

Reply via email to