Re: [Dev] [BRS] Error in CacheCleanupTask in BRS clustered setup when one node goes OOM

Milinda Perera Thu, 19 Nov 2015 03:25:41 -0800

Hi Azeez,

Actually we analyzed the OOM issue and figured out that it's from BRS side
[1] and at the moment we fixed it for stateless rule sessions. But there
are limitations for stateful rule sessions.


When request is received by BRS, It creates knowledgeSession from
KnowledgeBase and feed facts and execute rules.

In Drools there are two types of Knowledge sessions
(1) Stateful [2]
    Stateless knowledge sessions are long lived and knowledge keeps
reference until they get disposed. In BRS we have binded Stateful sessions
with Axis2 sessions.

(2) Stateless. [3]
    Stateless knowledge session is similar to a function, feed facts,
execute rules and receive results. Stateless knowledge session simply wraps
Stateful knowledge session and dispose it at the end of execution.

Since the knowledge base[4] keeps reference for Stateful knowledge
sessions, memory consumed by knowledgeBase get increase and goes OOM if we
don't dispose them.

Since the stateful rule service will use axis2 transport session scope , we
used the axis2 Lifecyle class to dispose the created session object.
However, it seems the underlying implementation of axis2 does not call the
relevant lifecyle interface dispose method when the session expires. It is
only called when the service context get garbage collected. This is the
cause of the OOM issue.

We tried this scennario to reproduce jira [5]. However, we see that if by
some reason a node in the cluster crashed, the one of other nodes will
start throwing the above mentioned hazelcast exception from cache cleanup
task.
What could be the reason?

[1] mail: [BRS] Solution for OOM issue due to misuse of drools engine in
BRS 220
[2]
https://docs.jboss.org/jbpm/v5.1/javadocs/org/drools/runtime/StatefulKnowledgeSession.html
[3]
https://docs.jboss.org/jbpm/v5.1/javadocs/org/drools/runtime/StatelessKnowledgeSession.html
[4] https://docs.jboss.org/jbpm/v5.1/javadocs/org/drools/KnowledgeBase.html
[5] https://wso2.org/jira/browse/BRS-100


Thanks,
Milinda

On Thu, Nov 19, 2015 at 10:22 AM, Afkham Azeez <[email protected]> wrote:

> Saying just a node went OOM is has no value. Also seeing a Hazelcast error
> in the stacktrace doesn't necessarily mean Hazelcast caused your node to go
> OOM. You have to profile and see why it is going OOM.
>
> On Thu, Nov 19, 2015 at 10:18 AM, Milinda Perera <[email protected]>
> wrote:
>
>> Hi,
>>
>> In BRS 220 snapshot (with kernel upgraded to 442), in clustered setup (in
>> our test 3 nodes). We did load test targeting one node (lets say node3) to
>> a Stateful rule service until it goes OOM, and following are the errors
>> shown in two nodes:
>>
>>
>> * CacheCleanup error is shown from one of the nodes which working fine
>> (in our case node2):*
>> [2015-11-17 17:16:58,951]  WARN {org.wso2.carbon.caching.impl.CacheImpl}
>> -  Exception occurred while expiring item from distributed cache. No
>> response for 120000 ms. Aborting invocation! Invocation{
>> serviceName='hz:impl:mapService',
>> op=RemoveOperation{$cache.$domain[carbon.super]Claim.Cache.Manager#Claim.Cache},
>> partitionId=64, replicaIndex=0, tryCount=250, tryPauseMillis=500,
>> invokeCount=1, callTimeout=60000, target=Address[10.100.5.92]:4002,
>> backupsExpected=0, backupsCompleted=0} No response has been received!
>> backups-expected:0 backups-completed: 0
>> [2015-11-17 17:21:08,971] ERROR
>> {org.wso2.carbon.caching.impl.CacheCleanupTask} -  Error occurred while
>> running CacheCleanupTask
>> com.hazelcast.core.OperationTimeoutException: No response for 120000 ms.
>> Aborting invocation! Invocation{ serviceName='hz:impl:mapService',
>> op=ClearOperation{}, partitionId=46, replicaIndex=0, tryCount=250,
>> tryPauseMillis=500, invokeCount=1, callTimeout=60000,
>> target=Address[10.100.5.92]:4002, backupsExpected=0, backupsCompleted=0} No
>> response has been received!  backups-expected:0 backups-completed: 0
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.retryFailedPartitions(InvokeOnPartitions.java:131)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:67)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:326)
>>     at
>> com.hazelcast.map.impl.proxy.MapProxySupport.clearInternal(MapProxySupport.java:914)
>>     at
>> com.hazelcast.map.impl.proxy.MapProxyImpl.clearInternal(MapProxyImpl.java:71)
>>     at
>> com.hazelcast.map.impl.proxy.MapProxyImpl.clear(MapProxyImpl.java:532)
>>     at
>> org.wso2.carbon.core.clustering.hazelcast.HazelcastDistributedMapProvider$DistMap.clear(HazelcastDistributedMapProvider.java:172)
>>     at org.wso2.carbon.caching.impl.CacheImpl.stop(CacheImpl.java:734)
>>     at
>> org.wso2.carbon.caching.impl.CarbonCacheManager.removeCache(CarbonCacheManager.java:168)
>>     at org.wso2.carbon.caching.impl.CacheImpl.expire(CacheImpl.java:769)
>>     at
>> org.wso2.carbon.caching.impl.CacheImpl.runCacheExpiry(CacheImpl.java:931)
>>     at
>> org.wso2.carbon.caching.impl.CacheCleanupTask.run(CacheCleanupTask.java:61)
>>     at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>>     at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>     at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>     at java.lang.Thread.run(Thread.java:745)
>>     at ------ End remote and begin local stack-trace ------.(Unknown
>> Source)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.retryFailedPartitions(InvokeOnPartitions.java:131)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.InvokeOnPartitions.invoke(InvokeOnPartitions.java:67)
>>     at
>> com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.invokeOnAllPartitions(OperationServiceImpl.java:326)
>>     at
>> com.hazelcast.map.impl.proxy.MapProxySupport.clearInternal(MapProxySupport.java:914)
>>     at
>> com.hazelcast.map.impl.proxy.MapProxyImpl.clearInternal(MapProxyImpl.java:71)
>>     at
>> com.hazelcast.map.impl.proxy.MapProxyImpl.clear(MapProxyImpl.java:532)
>>     at
>> org.wso2.carbon.core.clustering.hazelcast.HazelcastDistributedMapProvider$DistMap.clear(HazelcastDistributedMapProvider.java:172)
>>     at org.wso2.carbon.caching.impl.CacheImpl.stop(CacheImpl.java:734)
>>     at
>> org.wso2.carbon.caching.impl.CarbonCacheManager.removeCache(CarbonCacheManager.java:168)
>>     at org.wso2.carbon.caching.impl.CacheImpl.expire(CacheImpl.java:769)
>>     at
>> org.wso2.carbon.caching.impl.CacheImpl.runCacheExpiry(CacheImpl.java:931)
>>     at
>> org.wso2.carbon.caching.impl.CacheCleanupTask.run(CacheCleanupTask.java:61)
>>     at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>>     at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>     at
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>     at java.lang.Thread.run(Thread.java:745)
>> [2015-11-17 17:23:49,753]  WARN {org.wso2.carbon.caching.impl.CacheImpl}
>> -  Exception occurred while expiring item from distributed cache. No
>> response for 120000 ms. Aborting invocation! Invocation{
>> serviceName='hz:impl:mapService',
>> op=RemoveOperation{$cache.$domain[carbon.super]registryCacheManager#REG_PATH_CACHE},
>> partitionId=46, replicaIndex=0, tryCount=250, tryPauseMillis=500,
>> invokeCount=1, callTimeout=60000, target=Address[10.100.5.92]:4002,
>> backupsExpected=0, backupsCompleted=0} No response has been received!
>> backups-expected:0 backups-completed: 0
>> [2015-11-17 17:26:02,701]  WARN {org.wso2.carbon.caching.impl.CacheImpl}
>> -  Exception occurred while expiring item from distributed cache.
>> com.hazelcast.spi.exception.RetryableIOException: Packet not send to ->
>> Address[10.100.5.92]:4002
>> [2015-11-17 17:28:35,259]  WARN {org.wso2.carbon.caching.impl.CacheImpl}
>> -  Exception occurred while expiring item from distributed cache.
>> com.hazelcast.spi.exception.RetryableIOException: Packet not send to ->
>> Address[10.100.5.92]:4002
>> [2015-11-17 17:30:37,821]  WARN {org.wso2.carbon.caching.impl.CacheImpl}
>> -  Exception occurred while expiring item from distributed cache.
>> com.hazelcast.spi.exception.RetryableIOException: Packet not send to ->
>> Address[10.100.5.92]:4002
>>
>>
>> *And following error messages in the node which goes OOM (node3):*
>>
>> java.lang.OutOfMemoryError: Java heap space[2015-11-17 17:43:07,427]
>> ERROR {org.apache.tomcat.util.net.NioEndpoint$SocketProcessor} -
>> java.lang.OutOfMemoryError: Java heap space
>>
>> java.lang.OutOfMemoryError: Java heap space[2015-11-17 17:43:19,246]
>> ERROR
>> {com.hazelcast.spi.impl.operationexecutor.classic.ClassicOperationExecutor}
>> -  [10.100.5.92]:4002 [wso2.carbon.domain] [3.5.2] Failed to process
>> packet: Packet{header=1, isResponse=false, isOperation=true, isEvent=false,
>> partitionId=90, conn=Connection [0.0.0.0/0.0.0.0:4002 -> null],
>> endpoint=Address[10.100.5.92]:4001, live=false, type=MEMBER} on
>> hz.wso2.carbon.domain.instance.partition-operation.thread-2
>> java.lang.OutOfMemoryError: Java heap space
>>
>> java.lang.OutOfMemoryError: Java heap space[2015-11-17 17:43:22,764]
>> ERROR
>> {com.hazelcast.spi.impl.operationexecutor.classic.ClassicOperationExecutor}
>> -  [10.100.5.92]:4002 [wso2.carbon.domain] [3.5.2] Failed to process
>> packet: Packet{header=1, isResponse=false, isOperation=true, isEvent=false,
>> partitionId=110, conn=Connection [0.0.0.0/0.0.0.0:4002 -> null],
>> endpoint=Address[10.100.5.92]:4001, live=false, type=MEMBER} on
>> hz.wso2.carbon.domain.instance.partition-operation.thread-6
>> java.lang.OutOfMemoryError: Java heap space
>>
>> FYI: Other two nodes are working fine and serve requests fine.
>>
>> What could be the reason?
>>
>> Thanks,
>> Milinda
>>
>> --
>> Milinda Perera
>> Software Engineer;
>> WSO2 Inc. http://wso2.com ,
>> Mobile: (+94) 714 115 032
>>
>>
>
>
> --
> *Afkham Azeez*
> Director of Architecture; WSO2, Inc.; http://wso2.com
> Member; Apache Software Foundation; http://www.apache.org/
> * <http://www.apache.org/>*
> *email: **[email protected]* <[email protected]>
> * cell: +94 77 3320919 <%2B94%2077%203320919>blog: *
> *http://blog.afkham.org* <http://blog.afkham.org>
> *twitter: **http://twitter.com/afkham_azeez*
> <http://twitter.com/afkham_azeez>
> *linked-in: **http://lk.linkedin.com/in/afkhamazeez
> <http://lk.linkedin.com/in/afkhamazeez>*
>
> *Lean . Enterprise . Middleware*
>



-- 
Milinda Perera
Software Engineer;
WSO2 Inc. http://wso2.com ,
Mobile: (+94) 714 115 032

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [BRS] Error in CacheCleanupTask in BRS clustered setup when one node goes OOM

Reply via email to