[jira] [Updated] (IGNITE-14248) Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock properly

Vladimir Pligin (Jira) Fri, 26 Feb 2021 03:43:07 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vladimir Pligin updated IGNITE-14248:
-------------------------------------
    Description: 
If an exception (or even Error) is thrown inside of the method then the node 
turns into some unrecoverable state. Here's an example.
 # an exchange is about to finish, it's time to invalidate partition 
reservations.
 # exchange thread delegates it to a thread in the management pool
 # management pool tries to allocate a new thread (maybe it's idle and 
therefore empty)
 # for example ulimit is reached, the error is 
 java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
memory or process/resource limits reached
 # It's being logged, no further action is taken
 # partitions are reserved forever

Message:

 
{code:java}
2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR 
o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start 
reservations cleanup
java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
memory or process/resource limits reached
        at java.base/java.lang.Thread.start0(Native Method)
        at java.base/java.lang.Thread.start(Thread.java:803)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
        at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
        at 
org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
        at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
        at java.base/java.lang.Thread.run(Thread.java:834)
{code}
 

 

Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
{code:java}
@Override public void onDoneAfterTopologyUnlock(final 
GridDhtPartitionsExchangeFuture fut) {
        try {
            // Must not do anything at the exchange thread. Dispatch to the 
management thread pool.
            ctx.closure().runLocal(() -> {
                    AffinityTopologyVersion topVer = 
ctx.cache().context().exchange()
                        
.lastAffinityChangedTopologyVersion(fut.topologyVersion());                    
reservations.forEach((key, r) -> {
                        if (r != REPLICATED_RESERVABLE && 
!F.eq(key.topologyVersion(), topVer)) {
                            assert r instanceof GridDhtPartitionsReservation;   
                         ((GridDhtPartitionsReservation)r).invalidate();
                        }
                    });
                },
                GridIoPolicy.MANAGEMENT_POOL);
        }
        catch (Throwable e) {
            log.error("Unexpected exception on start reservations cleanup", e);
        }
    }
{code}
 

 

My vision is that there are two basic approaches:
 * to kill the node (it's already non-functional at this point), seems to be a 
FH job.
 * try to recover somehow (to be honest it's not clear how exactly)

This particular OOM situation seems unrecoverable in fact. It's an environment 
misconfiguration. It would be great to investigate if potentially recoverable 
exceptions are possible to be raised inside this block. 

  was:
If an exception (or even Error) is thrown inside of the method then the node 
turns into some unrecoverable state. Here's an example.
 # an exchange is about to finish, it's time to invalidate partition 
reservations.
 # exchange thread delegates it to a thread in the management pool
 # management pool tries to allocate a new thread (maybe it's idle and 
therefore empty)
 # for example ulimit is reached, the error is 
 java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
memory or process/resource limits reached
 # It's being logged, no further action is taken
 # partitions are reserved forever

Message:

 
{code:java}
2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR 
o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start 
reservations cleanup
java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
memory or process/resource limits reached
        at java.base/java.lang.Thread.start0(Native Method)
        at java.base/java.lang.Thread.start(Thread.java:803)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
        at 
org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
        at 
org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
        at 
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
        at 
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
        at 
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
        at 
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
        at java.base/java.lang.Thread.run(Thread.java:834)
{code}
 

 

Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
{code:java}
@Override public void onDoneAfterTopologyUnlock(final 
GridDhtPartitionsExchangeFuture fut) {
        try {
            // Must not do anything at the exchange thread. Dispatch to the 
management thread pool.
            ctx.closure().runLocal(() -> {
                    AffinityTopologyVersion topVer = 
ctx.cache().context().exchange()
                        
.lastAffinityChangedTopologyVersion(fut.topologyVersion());                    
reservations.forEach((key, r) -> {
                        if (r != REPLICATED_RESERVABLE && 
!F.eq(key.topologyVersion(), topVer)) {
                            assert r instanceof GridDhtPartitionsReservation;   
                         ((GridDhtPartitionsReservation)r).invalidate();
                        }
                    });
                },
                GridIoPolicy.MANAGEMENT_POOL);
        }
        catch (Throwable e) {
            log.error("Unexpected exception on start reservations cleanup", e);
        }
    }
{code}
 

 

My vision is that there are two basic approaches:
 * to kill the node (it's already non-functional at this point), seems to be a 
FH job.
 * try to recover somehow (to be honest it's not clear how exactly)

This particular OOM situation seems unrecoverable in fact. It's a environment 
misconfiguration. It would be great to investigate if potentially recoverable 
exceptions are possible to be raised inside this block. 


> Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock 
> properly
> -----------------------------------------------------------------------------------
>
>                 Key: IGNITE-14248
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14248
>             Project: Ignite
>          Issue Type: Improvement
>          Components: cache
>    Affects Versions: 2.9.1
>            Reporter: Vladimir Pligin
>            Assignee: Vyacheslav Koptilin
>            Priority: Major
>
> If an exception (or even Error) is thrown inside of the method then the node 
> turns into some unrecoverable state. Here's an example.
>  # an exchange is about to finish, it's time to invalidate partition 
> reservations.
>  # exchange thread delegates it to a thread in the management pool
>  # management pool tries to allocate a new thread (maybe it's idle and 
> therefore empty)
>  # for example ulimit is reached, the error is 
>  java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
> memory or process/resource limits reached
>  # It's being logged, no further action is taken
>  # partitions are reserved forever
> Message:
>  
> {code:java}
> 2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR 
> o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start 
> reservations cleanup
> java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
> memory or process/resource limits reached
>       at java.base/java.lang.Thread.start0(Native Method)
>       at java.base/java.lang.Thread.start(Thread.java:803)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
>       at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
>       at 
> org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
>       at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
>  
>  
> Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
> {code:java}
> @Override public void onDoneAfterTopologyUnlock(final 
> GridDhtPartitionsExchangeFuture fut) {
>         try {
>             // Must not do anything at the exchange thread. Dispatch to the 
> management thread pool.
>             ctx.closure().runLocal(() -> {
>                     AffinityTopologyVersion topVer = 
> ctx.cache().context().exchange()
>                         
> .lastAffinityChangedTopologyVersion(fut.topologyVersion());                   
>  reservations.forEach((key, r) -> {
>                         if (r != REPLICATED_RESERVABLE && 
> !F.eq(key.topologyVersion(), topVer)) {
>                             assert r instanceof GridDhtPartitionsReservation; 
>                            ((GridDhtPartitionsReservation)r).invalidate();
>                         }
>                     });
>                 },
>                 GridIoPolicy.MANAGEMENT_POOL);
>         }
>         catch (Throwable e) {
>             log.error("Unexpected exception on start reservations cleanup", 
> e);
>         }
>     }
> {code}
>  
>  
> My vision is that there are two basic approaches:
>  * to kill the node (it's already non-functional at this point), seems to be 
> a FH job.
>  * try to recover somehow (to be honest it's not clear how exactly)
> This particular OOM situation seems unrecoverable in fact. It's an 
> environment misconfiguration. It would be great to investigate if potentially 
> recoverable exceptions are possible to be raised inside this block. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IGNITE-14248) Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock properly

Reply via email to