[
https://issues.apache.org/jira/browse/IGNITE-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371884#comment-17371884
]
Vladimir Pligin commented on IGNITE-14248:
------------------------------------------
Hi [~agidaspov], sorry, that's my bad. It's been done.
> Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock
> properly
> -----------------------------------------------------------------------------------
>
> Key: IGNITE-14248
> URL: https://issues.apache.org/jira/browse/IGNITE-14248
> Project: Ignite
> Issue Type: Improvement
> Components: cache
> Affects Versions: 2.9.1
> Reporter: Vladimir Pligin
> Assignee: Vladimir Pligin
> Priority: Major
> Fix For: 2.11
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> If an exception (or even Error) is thrown inside of the method then the node
> turns into some unrecoverable state. Here's an example.
> # an exchange is about to finish, it's time to invalidate partition
> reservations.
> # exchange thread delegates it to a thread in the management pool
> # management pool tries to allocate a new thread (maybe it's idle and
> therefore empty)
> # for example ulimit is reached, the error is
> java.lang.OutOfMemoryError: unable to create native thread: possibly out of
> memory or process/resource limits reached
> # It's being logged, no further action is taken
> # partitions are reserved forever
> Message:
>
> {code:java}
> 2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR
> o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start
> reservations cleanup
> java.lang.OutOfMemoryError: unable to create native thread: possibly out of
> memory or process/resource limits reached
> at java.base/java.lang.Thread.start0(Native Method)
> at java.base/java.lang.Thread.start(Thread.java:803)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
> at
> org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
> at
> org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
> at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
> at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
>
>
> Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
> {code:java}
> @Override public void onDoneAfterTopologyUnlock(final
> GridDhtPartitionsExchangeFuture fut) {
> try {
> // Must not do anything at the exchange thread. Dispatch to the
> management thread pool.
> ctx.closure().runLocal(() -> {
> AffinityTopologyVersion topVer =
> ctx.cache().context().exchange()
>
> .lastAffinityChangedTopologyVersion(fut.topologyVersion());
> reservations.forEach((key, r) -> {
> if (r != REPLICATED_RESERVABLE &&
> !F.eq(key.topologyVersion(), topVer)) {
> assert r instanceof GridDhtPartitionsReservation;
> ((GridDhtPartitionsReservation)r).invalidate();
> }
> });
> },
> GridIoPolicy.MANAGEMENT_POOL);
> }
> catch (Throwable e) {
> log.error("Unexpected exception on start reservations cleanup",
> e);
> }
> }
> {code}
>
>
> My vision is that there are two basic approaches:
> * to kill the node (it's already non-functional at this point), seems to be
> a FH job.
> * try to recover somehow (to be honest it's not clear how exactly)
> This particular OOM situation seems unrecoverable in fact. It's an
> environment misconfiguration. It would be great to investigate if potentially
> recoverable exceptions are possible to be raised inside this block.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)