[ 
https://issues.apache.org/jira/browse/IGNITE-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301591#comment-17301591
 ] 

Vyacheslav Koptilin commented on IGNITE-14248:
----------------------------------------------

Hello [~Vladimir Pligin],

> to kill the node (it's already non-functional at this point), seems to be a 
> FH job.
It seems to me, this approach is straightforward and preferable.

> This particular OOM situation seems unrecoverable in fact. It's an 
> environment misconfiguration.
Yep, you are right.

>  It would be great to investigate if potentially recoverable exceptions are 
> possible to be raised inside this block.
In my understanding, we should try to address all recoverable issues inside of 
this callback and do not propagate them to the "exchange" thread (this just 
triggers FailureHandler).

> Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock 
> properly
> -----------------------------------------------------------------------------------
>
>                 Key: IGNITE-14248
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14248
>             Project: Ignite
>          Issue Type: Improvement
>          Components: cache
>    Affects Versions: 2.9.1
>            Reporter: Vladimir Pligin
>            Assignee: Vyacheslav Koptilin
>            Priority: Major
>
> If an exception (or even Error) is thrown inside of the method then the node 
> turns into some unrecoverable state. Here's an example.
>  # an exchange is about to finish, it's time to invalidate partition 
> reservations.
>  # exchange thread delegates it to a thread in the management pool
>  # management pool tries to allocate a new thread (maybe it's idle and 
> therefore empty)
>  # for example ulimit is reached, the error is 
>  java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
> memory or process/resource limits reached
>  # It's being logged, no further action is taken
>  # partitions are reserved forever
> Message:
>  
> {code:java}
> 2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR 
> o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start 
> reservations cleanup
> java.lang.OutOfMemoryError: unable to create native thread: possibly out of 
> memory or process/resource limits reached
>       at java.base/java.lang.Thread.start0(Native Method)
>       at java.base/java.lang.Thread.start(Thread.java:803)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
>       at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
>       at 
> org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
>       at 
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
>       at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
>       at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
>       at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
>       at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
>  
>  
> Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
> {code:java}
> @Override public void onDoneAfterTopologyUnlock(final 
> GridDhtPartitionsExchangeFuture fut) {
>         try {
>             // Must not do anything at the exchange thread. Dispatch to the 
> management thread pool.
>             ctx.closure().runLocal(() -> {
>                     AffinityTopologyVersion topVer = 
> ctx.cache().context().exchange()
>                         
> .lastAffinityChangedTopologyVersion(fut.topologyVersion());                   
>  reservations.forEach((key, r) -> {
>                         if (r != REPLICATED_RESERVABLE && 
> !F.eq(key.topologyVersion(), topVer)) {
>                             assert r instanceof GridDhtPartitionsReservation; 
>                            ((GridDhtPartitionsReservation)r).invalidate();
>                         }
>                     });
>                 },
>                 GridIoPolicy.MANAGEMENT_POOL);
>         }
>         catch (Throwable e) {
>             log.error("Unexpected exception on start reservations cleanup", 
> e);
>         }
>     }
> {code}
>  
>  
> My vision is that there are two basic approaches:
>  * to kill the node (it's already non-functional at this point), seems to be 
> a FH job.
>  * try to recover somehow (to be honest it's not clear how exactly)
> This particular OOM situation seems unrecoverable in fact. It's an 
> environment misconfiguration. It would be great to investigate if potentially 
> recoverable exceptions are possible to be raised inside this block. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to