[jira] [Updated] (IGNITE-11974) infinite loop and 100% cpu in GridDhtPartitionsEvictor: Eviction in progress ...

Igor Kamyshnikov (JIRA) Wed, 10 Jul 2019 06:19:24 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-11974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Igor Kamyshnikov updated IGNITE-11974:
--------------------------------------
    Description: 
Note: RCA was not done:

Sometimes ignite server nodes fall into infinite loop and consume 100% cpu:
{noformat}
"sys-#260008" #260285 prio=5 os_prio=0 tid=0x00007fabb020a800 nid=0x1e850 
runnable [0x00007fab26fef000]
   java.lang.Thread.State: RUNNABLE
        at 
java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
        at 
java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:84)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:73)
        at 
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6695)
        at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - <0x0000000649b9cba0> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
{noformat}

the following appears in logs each 2 minutes:
{noformat}
#### INFO  2019-07-08 12:21:45.081 (1562581305081) [sys-#98168] 
[GridDhtPartitionsEvictor] > Eviction in progress 
[grp=CUSTPRODINVOICEDISCUSAGE, remainingCnt=102]
{noformat}

remainingCnt remains the same once it reached 102 (the very first line in the 
logs was with value equal to 101).

Some other facts:
we have a heapdump taken for *topVer = 900* . the problem appeared after 
*topVer = 790*, but it looks like it was silently waiting from *topVer = 641* 
(about 24 hours back).
There were 259 topology changes between 900 and 641.

All 102 GridDhtLocalPartitions can be found in the heapdump:
{noformat}
select * from 
"org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition"
 t where delayedRenting = true
{noformat}

They all have status = 65537 , which means (according to 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition#state):

reservations(65537) = 1
getPartState(65537) = OWNING

There are also 26968 instances of 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl$$Lambda$70,
 that are created by 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl#checkEvictions
 method.
26418 of 26968 refer to AtomicInteger instance with value = 102:
26418/102 = 259 = 900 - 641 (see topology info above).

The key thing seen from the heapdump is that topVer = 641 or topVer = 642 was 
the last topology where these 102 partitions were assigned to the current 
ignite server node.

{noformat}
select
  t.this
 ,t.this['clientEvtChange'] as clientEvtChange
 ,t.this['topVer.topVer'] as topVer
 
,t.this['assignment.elementData'][555]['elementData'][0]['hostNames.elementData'][0]
 as primary_part
 
,t.this['assignment.elementData'][555]['elementData'][1]['hostNames.elementData'][0]
 as secondary_part
from org.apache.ignite.internal.processors.affinity.HistoryAffinityAssignment t 
where length(t.this['assignment.elementData']) = 1024
order by topVer
{noformat}

 !image-2019-07-10-16-07-37-185.png! 

The connection of a client node at topVer = 790 somehow triggered the 
GridDhtPartitionsEvictor loop to execute.

Summary:
1) it is seen that 102 partitions has one reservation and OWNING state.
2) they were backup partitions.
3) for some reason their eviction has been silently delaying (because of 
reservations), but each topology change seemed to trigger eviction attempt.
4) something managed to make 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor#evictPartitionAsync
 to run never exiting.

Additional info:
topVer = 641 was in chain of sever nodes restarts (not sure if rebalancing 
actually succeeded):
 !server-node-restarts-1.png! 

  was:
Note: RCA was not done:

Sometimes ignite server nodes fall into infinite loop and consume 100% cpu:
{noformat}
"sys-#260008" #260285 prio=5 os_prio=0 tid=0x00007fabb020a800 nid=0x1e850 
runnable [0x00007fab26fef000]
   java.lang.Thread.State: RUNNABLE
        at 
java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
        at 
java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:84)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:73)
        at 
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6695)
        at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
        - <0x0000000649b9cba0> (a 
java.util.concurrent.ThreadPoolExecutor$Worker)
{noformat}

the following appears in logs each 2 minutes:
{noformat}
#### INFO  2019-07-08 12:21:45.081 (1562581305081) [sys-#98168] 
[GridDhtPartitionsEvictor] > Eviction in progress 
[grp=CUSTPRODINVOICEDISCUSAGE, remainingCnt=102]
{noformat}

remainingCnt remains the same once it reached 102 (the very first line in the 
logs was with value equal to 101).

Some other facts:
we have a heapdump taken for *topVer = 900* . the problem appeared after 
*topVer = 790*, but it looks like it was silently waiting from *topVer = 641* 
(about 24 hours back).
There were 259 topology changes between 900 and 641.

All 102 GridDhtLocalPartitions can be found in the heapdump:
{noformat}
select * from 
"org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition"
 t where delayedRenting = true
{noformat}

They all have status = 65537 , which means (according to 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition#state):

reservations(65537) = 1
getPartState(65537) = OWNING

There are also 26968 instances of 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl$$Lambda$70,
 that are created by 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl#checkEvictions
 method.
26418 of 26968 refer to AtomicInteger instance with value = 102:
26418/102 = 259 = 900 - 641 (see topology info above).

The key thing seen from the heapdump is that topVer = 641 or topVer = 642 was 
the last topology where these 102 partitions were assigned to the current 
ignite server node.

{noformat}
select
  t.this
 ,t.this['clientEvtChange'] as clientEvtChange
 ,t.this['topVer.topVer'] as topVer
 
,t.this['assignment.elementData'][555]['elementData'][0]['hostNames.elementData'][0]
 as primary_part
 
,t.this['assignment.elementData'][555]['elementData'][1]['hostNames.elementData'][0]
 as secondary_part
from org.apache.ignite.internal.processors.affinity.HistoryAffinityAssignment t 
where length(t.this['assignment.elementData']) = 1024
order by topVer
{noformat}

 !image-2019-07-10-16-07-37-185.png! 

The connection of a client node at topVer = 790 somehow triggered the 
GridDhtPartitionsEvictor loop to execute.

Summary:
1) it is seen that 102 partitions has one reservation and OWNING state.
2) they were backup partitions.
3) for some reason their eviction has been silently delaying (because of 
reservations), but each topology change seemed to trigger eviction attempt.
4) something managed to make 
org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor#evictPartitionAsync
 to run never exiting.



> infinite loop and 100% cpu in GridDhtPartitionsEvictor: Eviction in progress 
> ...
> --------------------------------------------------------------------------------
>
>                 Key: IGNITE-11974
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11974
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.5
>            Reporter: Igor Kamyshnikov
>            Priority: Blocker
>         Attachments: image-2019-07-10-16-07-37-185.png, 
> server-node-restarts-1.png
>
>
> Note: RCA was not done:
> Sometimes ignite server nodes fall into infinite loop and consume 100% cpu:
> {noformat}
> "sys-#260008" #260285 prio=5 os_prio=0 tid=0x00007fabb020a800 nid=0x1e850 
> runnable [0x00007fab26fef000]
>    java.lang.Thread.State: RUNNABLE
>       at 
> java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
>       at 
> java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:84)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:73)
>       at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6695)
>       at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - <0x0000000649b9cba0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> the following appears in logs each 2 minutes:
> {noformat}
> #### INFO  2019-07-08 12:21:45.081 (1562581305081) [sys-#98168] 
> [GridDhtPartitionsEvictor] > Eviction in progress 
> [grp=CUSTPRODINVOICEDISCUSAGE, remainingCnt=102]
> {noformat}
> remainingCnt remains the same once it reached 102 (the very first line in the 
> logs was with value equal to 101).
> Some other facts:
> we have a heapdump taken for *topVer = 900* . the problem appeared after 
> *topVer = 790*, but it looks like it was silently waiting from *topVer = 641* 
> (about 24 hours back).
> There were 259 topology changes between 900 and 641.
> All 102 GridDhtLocalPartitions can be found in the heapdump:
> {noformat}
> select * from 
> "org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition"
>  t where delayedRenting = true
> {noformat}
> They all have status = 65537 , which means (according to 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition#state):
> reservations(65537) = 1
> getPartState(65537) = OWNING
> There are also 26968 instances of 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl$$Lambda$70,
>  that are created by 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl#checkEvictions
>  method.
> 26418 of 26968 refer to AtomicInteger instance with value = 102:
> 26418/102 = 259 = 900 - 641 (see topology info above).
> The key thing seen from the heapdump is that topVer = 641 or topVer = 642 was 
> the last topology where these 102 partitions were assigned to the current 
> ignite server node.
> {noformat}
> select
>   t.this
>  ,t.this['clientEvtChange'] as clientEvtChange
>  ,t.this['topVer.topVer'] as topVer
>  
> ,t.this['assignment.elementData'][555]['elementData'][0]['hostNames.elementData'][0]
>  as primary_part
>  
> ,t.this['assignment.elementData'][555]['elementData'][1]['hostNames.elementData'][0]
>  as secondary_part
> from org.apache.ignite.internal.processors.affinity.HistoryAffinityAssignment 
> t where length(t.this['assignment.elementData']) = 1024
> order by topVer
> {noformat}
>  !image-2019-07-10-16-07-37-185.png! 
> The connection of a client node at topVer = 790 somehow triggered the 
> GridDhtPartitionsEvictor loop to execute.
> Summary:
> 1) it is seen that 102 partitions has one reservation and OWNING state.
> 2) they were backup partitions.
> 3) for some reason their eviction has been silently delaying (because of 
> reservations), but each topology change seemed to trigger eviction attempt.
> 4) something managed to make 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor#evictPartitionAsync
>  to run never exiting.
> Additional info:
> topVer = 641 was in chain of sever nodes restarts (not sure if rebalancing 
> actually succeeded):
>  !server-node-restarts-1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-11974) infinite loop and 100% cpu in GridDhtPartitionsEvictor: Eviction in progress ...

Reply via email to