[
https://issues.apache.org/jira/browse/IGNITE-14248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Pligin updated IGNITE-14248:
-------------------------------------
Description:
If an exception (or even Error) is thrown inside of the method then the node
turns into some unrecoverable state. Here's an example.
# an exchange is about to finish, it's time to invalidate partition
reservations.
# exchange thread delegates it to a thread in the management pool
# management pool tries to allocate a new thread (maybe it's idle and
therefore empty)
# for example ulimit is reached, the error is
java.lang.OutOfMemoryError: unable to create native thread: possibly out of
memory or process/resource limits reached
# It's being logged, no further action is taken
# partitions are reserved forever
Message:
{code:java}
2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR
o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start
reservations cleanup
java.lang.OutOfMemoryError: unable to create native thread: possibly out of
memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:803)
at
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
at
org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
at
org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
at java.base/java.lang.Thread.run(Thread.java:834)
{code}
Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
{code:java}
@Override public void onDoneAfterTopologyUnlock(final
GridDhtPartitionsExchangeFuture fut) {
try {
// Must not do anything at the exchange thread. Dispatch to the
management thread pool.
ctx.closure().runLocal(() -> {
AffinityTopologyVersion topVer =
ctx.cache().context().exchange()
.lastAffinityChangedTopologyVersion(fut.topologyVersion());
reservations.forEach((key, r) -> {
if (r != REPLICATED_RESERVABLE &&
!F.eq(key.topologyVersion(), topVer)) {
assert r instanceof GridDhtPartitionsReservation;
((GridDhtPartitionsReservation)r).invalidate();
}
});
},
GridIoPolicy.MANAGEMENT_POOL);
}
catch (Throwable e) {
log.error("Unexpected exception on start reservations cleanup", e);
}
}
{code}
My vision is that there are two basic approaches:
* to kill the node (it's already non-functional at this point), seems to be a
FH job.
* try to recover somehow (to be honest it's not clear how exactly)
This particular OOM situation seems unrecoverable in fact. It's an environment
misconfiguration. It would be great to investigate if potentially recoverable
exceptions are possible to be raised inside this block.
was:
If an exception (or even Error) is thrown inside of the method then the node
turns into some unrecoverable state. Here's an example.
# an exchange is about to finish, it's time to invalidate partition
reservations.
# exchange thread delegates it to a thread in the management pool
# management pool tries to allocate a new thread (maybe it's idle and
therefore empty)
# for example ulimit is reached, the error is
java.lang.OutOfMemoryError: unable to create native thread: possibly out of
memory or process/resource limits reached
# It's being logged, no further action is taken
# partitions are reserved forever
Message:
{code:java}
2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR
o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start
reservations cleanup
java.lang.OutOfMemoryError: unable to create native thread: possibly out of
memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:803)
at
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
at
org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
at
org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
at java.base/java.lang.Thread.run(Thread.java:834)
{code}
Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
{code:java}
@Override public void onDoneAfterTopologyUnlock(final
GridDhtPartitionsExchangeFuture fut) {
try {
// Must not do anything at the exchange thread. Dispatch to the
management thread pool.
ctx.closure().runLocal(() -> {
AffinityTopologyVersion topVer =
ctx.cache().context().exchange()
.lastAffinityChangedTopologyVersion(fut.topologyVersion());
reservations.forEach((key, r) -> {
if (r != REPLICATED_RESERVABLE &&
!F.eq(key.topologyVersion(), topVer)) {
assert r instanceof GridDhtPartitionsReservation;
((GridDhtPartitionsReservation)r).invalidate();
}
});
},
GridIoPolicy.MANAGEMENT_POOL);
}
catch (Throwable e) {
log.error("Unexpected exception on start reservations cleanup", e);
}
}
{code}
My vision is that there are two basic approaches:
* to kill the node (it's already non-functional at this point), seems to be a
FH job.
* try to recover somehow (to be honest it's not clear how exactly)
This particular OOM situation seems unrecoverable in fact. It's a environment
misconfiguration. It would be great to investigate if potentially recoverable
exceptions are possible to be raised inside this block.
> Handle exceptions in PartitionReservationManager.onDoneAfterTopologyUnlock
> properly
> -----------------------------------------------------------------------------------
>
> Key: IGNITE-14248
> URL: https://issues.apache.org/jira/browse/IGNITE-14248
> Project: Ignite
> Issue Type: Improvement
> Components: cache
> Affects Versions: 2.9.1
> Reporter: Vladimir Pligin
> Assignee: Vyacheslav Koptilin
> Priority: Major
>
> If an exception (or even Error) is thrown inside of the method then the node
> turns into some unrecoverable state. Here's an example.
> # an exchange is about to finish, it's time to invalidate partition
> reservations.
> # exchange thread delegates it to a thread in the management pool
> # management pool tries to allocate a new thread (maybe it's idle and
> therefore empty)
> # for example ulimit is reached, the error is
> java.lang.OutOfMemoryError: unable to create native thread: possibly out of
> memory or process/resource limits reached
> # It's being logged, no further action is taken
> # partitions are reserved forever
> Message:
>
> {code:java}
> 2021-02-25 05:52:03.242 [exchange-worker-#182] ERROR
> o.a.i.i.p.q.h.t.PartitionReservationManager - Unexpected exception on start
> reservations cleanup
> java.lang.OutOfMemoryError: unable to create native thread: possibly out of
> memory or process/resource limits reached
> at java.base/java.lang.Thread.start0(Native Method)
> at java.base/java.lang.Thread.start(Thread.java:803)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
> at
> org.apache.ignite.internal.processors.closure.GridClosureProcessor.runLocal(GridClosureProcessor.java:847)
> at
> org.apache.ignite.internal.processors.query.h2.twostep.PartitionReservationManager.onDoneAfterTopologyUnlock(PartitionReservationManager.java:323)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:2617)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onDone(GridDhtPartitionsExchangeFuture.java:159)
> at
> org.apache.ignite.internal.util.future.GridFutureAdapter.onDone(GridFutureAdapter.java:475)
> at
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1064)
> at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3375)
> at
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3194)
> at
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
>
>
> Code of PartitionReservationManager.onDoneAfterTopologyUnlock:
> {code:java}
> @Override public void onDoneAfterTopologyUnlock(final
> GridDhtPartitionsExchangeFuture fut) {
> try {
> // Must not do anything at the exchange thread. Dispatch to the
> management thread pool.
> ctx.closure().runLocal(() -> {
> AffinityTopologyVersion topVer =
> ctx.cache().context().exchange()
>
> .lastAffinityChangedTopologyVersion(fut.topologyVersion());
> reservations.forEach((key, r) -> {
> if (r != REPLICATED_RESERVABLE &&
> !F.eq(key.topologyVersion(), topVer)) {
> assert r instanceof GridDhtPartitionsReservation;
> ((GridDhtPartitionsReservation)r).invalidate();
> }
> });
> },
> GridIoPolicy.MANAGEMENT_POOL);
> }
> catch (Throwable e) {
> log.error("Unexpected exception on start reservations cleanup",
> e);
> }
> }
> {code}
>
>
> My vision is that there are two basic approaches:
> * to kill the node (it's already non-functional at this point), seems to be
> a FH job.
> * try to recover somehow (to be honest it's not clear how exactly)
> This particular OOM situation seems unrecoverable in fact. It's an
> environment misconfiguration. It would be great to investigate if potentially
> recoverable exceptions are possible to be raised inside this block.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)