[
https://issues.apache.org/jira/browse/KARAF-7969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vineeth updated KARAF-7969:
---------------------------
Description:
Hello
I have an AIX 7.3 system where I’m able to replicate the issue using IBM J9
Java 11
>From the Thread dump , i am able to locate one issue in
>*org.apache.karaf.main.Main* class in
{code:java}
public void awaitShutdown() throws Exception { if (framework == null) { return;
} while (true) { FrameworkEvent event = framework.waitForStop(0); if
(event.getType() == FrameworkEvent.STOPPED_UPDATE) { if (lock != null) {
lock.release(); } while (framework.getState() != Bundle.STARTING &&
framework.getState() != Bundle.ACTIVE) { Thread.sleep(10); } monitorThread =
monitor(); } else { return; } } }
{code}
The main thread calls {{framework.waitForStop(0)}} with a timeout of 0, meaning
it will wait indefinitely
This calls into ThreadGate.await() in the Felix framework, which uses
Object.wait() to block until notified
The parameter 0 means "wait forever" there's no timeout safety mechanism,
correct?For the main thread to unblock, the ThreadGate object needs to receive
a notification via notify() or notifyAll()
This notification should come from the Felix framework when it completes
shutdown
However, the Felix framework threads (FelixDispatchQueue, FelixFrameworkWiring,
FelixStartLevel) are all waiting themselves.
_*+Here it should have set some timeout value something like 30
seconds(30000).?, What you think? Please feel free to correct me.+*_
In thread dump,
3XMTHREADINFO "main" J9VMThread:0x0000000030010700,
omrthread_t:0x00000100100B2CE0, java/lang/Thread:0x00000000F007C2D0, state:CW,
prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x1, isDaemon:false)
3XMJAVALTHRCCL
jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000F0075810)
3XMTHREADINFO1 (native thread ID:0x1320327, native priority:0x5,
native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000181)
3XMTHREADINFO2 (native stack address range from:0x0000010010017D90,
to:0x000001001009D788, size:0x859F8)
3XMCPUTIME CPU usage total: 0.449041000 secs, user: 0.373210000
secs, system: 0.075831000 secs, current category="Application"3XMTHREADBLOCK
Waiting on: org/apache/felix/framework/util/ThreadGate@0x00000000F05AF2D8
Owned by: <unowned>
3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3 Java callstack:
4XESTACKTRACE at java/lang/Object.waitImpl(Native Method)
4XESTACKTRACE at java/lang/Object.wait(Object.java:251)
4XESTACKTRACE at java/lang/Object.wait(Object.java:219)
4XESTACKTRACE at
org/apache/felix/framework/util/ThreadGate.await(ThreadGate.java:79)
4XESTACKTRACE at
org/apache/felix/framework/Felix.waitForStop(Felix.java:1075)
4XESTACKTRACE at
org/apache/karaf/main/Main.awaitShutdown(Main.java:671)
4XESTACKTRACE at org/apache/karaf/main/Main.main(Main.java:190)
3XMTHREADINFO3 Native callstack:
4XENATIVESTACK _event_wait+0x2c (0x09000000006AF470
[libpthreads.a+0x19470])
4XENATIVESTACK _cond_wait_local+0x2e4 (0x09000000006B75E8
[libpthreads.a+0x215e8])
4XENATIVESTACK _cond_wait+0x34 (0x09000000006B7D38
[libpthreads.a+0x21d38])
4XENATIVESTACK pthread_cond_wait+0x1a8 (0x09000000006B880C
[libpthreads.a+0x2280c])
4XENATIVESTACK IPRA.$monitor_wait_original+0xa10
(0x0900000000D88CD4 [libj9thr29.so+0x9cd4])
4XENATIVESTACK omrthread_monitor_wait_interruptable+0x50
(0x0900000000D89474 [libj9thr29.so+0xa474])
4XENATIVESTACK monitorWaitImpl+0x488 (0x090000000D84668C
[libj9vm29.so+0x8168c])
4XENATIVESTACK (0x090000000D9BC924 [libj9vm29.so+0x1f7924])
4XENATIVESTACK (0x090000000D8A9168 [libj9vm29.so+0xe4168])
4XENATIVESTACK runCallInMethod+0x2d0 (0x090000000D8073F4
[libj9vm29.so+0x423f4])
4XENATIVESTACK _ZL26gpProtectedRunCallInMethodPv+0x4c
(0x090000000D7C66F0 [libj9vm29.so+0x16f0])
4XENATIVESTACK signalProtectAndRunGlue+0x28 (0x090000000D822A8C
[libj9vm29.so+0x5da8c])
4XENATIVESTACK omrsig_protect+0x4fc (0x090000000D555760
[libj9prt29.so+0x5f760])
4XENATIVESTACK gpProtectAndRun+0xf0 (0x090000000D8227F4
[libj9vm29.so+0x5d7f4])
4XENATIVESTACK gpCheckCallin+0x118 (0x090000000D7C665C
[libj9vm29.so+0x165c])
4XENATIVESTACK callStaticVoidMethod+0x44 (0x090000000D8E9728
[libj9vm29.so+0x124728])
4XENATIVESTACK JavaMain+0xc14 (0x0000010000007258 [java+0x7258])
4XENATIVESTACK ThreadJavaMain+0xc (0x000001000000DCB0
[java+0xdcb0])
4XENATIVESTACK (0x0900000000089214 [libpthreads.a+0x214])
I'm not entirely sure how everything is linked together, but it seems to form a
larger race condition, which eventually leads to a hang state in the Karaf
bundles. This issue can be easily reproduced in server mode.
Simply use the {{/bin/start}} script, monitor the status(/bin/status), and once
the server has started completely, run the {{/bin/stop}} script in a loop.
After some iteration the system enters a hang state. I managed to capture
thread and heap dumps during this, which gave me some insights that I’ve shared
here.
Thanks
was:
Hello
I have an AIX 7.3 system where I’m able to replicate the issue using IBM J9
Java 11
>From the Thread dump , i am able to locate one issue in
>*org.apache.karaf.main.Main* class in
public void awaitShutdown() throws Exception { if (framework == null) {
return;
} while (true) {
FrameworkEvent event = framework.waitForStop(0); if
(event.getType() == FrameworkEvent.STOPPED_UPDATE) { if (lock !=
null) \{
lock.release();
} while (framework.getState() != Bundle.STARTING &&
framework.getState() != Bundle.ACTIVE) { Thread.sleep(10);
}
monitorThread = monitor();
} else { return;
}
}
}
The main thread calls {{framework.waitForStop(0)}} with a timeout of 0, meaning
it will wait indefinitely
This calls into ThreadGate.await() in the Felix framework, which uses
Object.wait() to block until notified
The parameter 0 means "wait forever" there's no timeout safety mechanism,
correct?For the main thread to unblock, the ThreadGate object needs to receive
a notification via notify() or notifyAll()
This notification should come from the Felix framework when it completes
shutdown
However, the Felix framework threads (FelixDispatchQueue, FelixFrameworkWiring,
FelixStartLevel) are all waiting themselves.
_*+Here it should have set some timeout value something like 30
seconds(30000).?, What you think? Please feel free to correct me.+*_
In thread dump,
3XMTHREADINFO "main" J9VMThread:0x0000000030010700,
omrthread_t:0x00000100100B2CE0, java/lang/Thread:0x00000000F007C2D0, state:CW,
prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x1, isDaemon:false)
3XMJAVALTHRCCL
jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000F0075810)
3XMTHREADINFO1 (native thread ID:0x1320327, native priority:0x5,
native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000181)
3XMTHREADINFO2 (native stack address range from:0x0000010010017D90,
to:0x000001001009D788, size:0x859F8)
3XMCPUTIME CPU usage total: 0.449041000 secs, user: 0.373210000
secs, system: 0.075831000 secs, current category="Application"3XMTHREADBLOCK
Waiting on: org/apache/felix/framework/util/ThreadGate@0x00000000F05AF2D8
Owned by: <unowned>
3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3 Java callstack:
4XESTACKTRACE at java/lang/Object.waitImpl(Native Method)
4XESTACKTRACE at java/lang/Object.wait(Object.java:251)
4XESTACKTRACE at java/lang/Object.wait(Object.java:219)
4XESTACKTRACE at
org/apache/felix/framework/util/ThreadGate.await(ThreadGate.java:79)
4XESTACKTRACE at
org/apache/felix/framework/Felix.waitForStop(Felix.java:1075)
4XESTACKTRACE at
org/apache/karaf/main/Main.awaitShutdown(Main.java:671)
4XESTACKTRACE at org/apache/karaf/main/Main.main(Main.java:190)
3XMTHREADINFO3 Native callstack:
4XENATIVESTACK _event_wait+0x2c (0x09000000006AF470
[libpthreads.a+0x19470])
4XENATIVESTACK _cond_wait_local+0x2e4 (0x09000000006B75E8
[libpthreads.a+0x215e8])
4XENATIVESTACK _cond_wait+0x34 (0x09000000006B7D38
[libpthreads.a+0x21d38])
4XENATIVESTACK pthread_cond_wait+0x1a8 (0x09000000006B880C
[libpthreads.a+0x2280c])
4XENATIVESTACK IPRA.$monitor_wait_original+0xa10
(0x0900000000D88CD4 [libj9thr29.so+0x9cd4])
4XENATIVESTACK omrthread_monitor_wait_interruptable+0x50
(0x0900000000D89474 [libj9thr29.so+0xa474])
4XENATIVESTACK monitorWaitImpl+0x488 (0x090000000D84668C
[libj9vm29.so+0x8168c])
4XENATIVESTACK (0x090000000D9BC924 [libj9vm29.so+0x1f7924])
4XENATIVESTACK (0x090000000D8A9168 [libj9vm29.so+0xe4168])
4XENATIVESTACK runCallInMethod+0x2d0 (0x090000000D8073F4
[libj9vm29.so+0x423f4])
4XENATIVESTACK _ZL26gpProtectedRunCallInMethodPv+0x4c
(0x090000000D7C66F0 [libj9vm29.so+0x16f0])
4XENATIVESTACK signalProtectAndRunGlue+0x28 (0x090000000D822A8C
[libj9vm29.so+0x5da8c])
4XENATIVESTACK omrsig_protect+0x4fc (0x090000000D555760
[libj9prt29.so+0x5f760])
4XENATIVESTACK gpProtectAndRun+0xf0 (0x090000000D8227F4
[libj9vm29.so+0x5d7f4])
4XENATIVESTACK gpCheckCallin+0x118 (0x090000000D7C665C
[libj9vm29.so+0x165c])
4XENATIVESTACK callStaticVoidMethod+0x44 (0x090000000D8E9728
[libj9vm29.so+0x124728])
4XENATIVESTACK JavaMain+0xc14 (0x0000010000007258 [java+0x7258])
4XENATIVESTACK ThreadJavaMain+0xc (0x000001000000DCB0
[java+0xdcb0])
4XENATIVESTACK (0x0900000000089214 [libpthreads.a+0x214])
I'm not entirely sure how everything is linked together, but it seems to form a
larger race condition, which eventually leads to a hang state in the Karaf
bundles. This issue can be easily reproduced in server mode.
Simply use the {{/bin/start}} script, monitor the status(/bin/status), and once
the server has started completely, run the {{/bin/stop}} script in a loop.
After some iteration the system enters a hang state. I managed to capture
thread and heap dumps during this, which gave me some insights that I’ve shared
here.
Thanks
> Apache Karaf bundles entered a hang state due to a thread wait
> --------------------------------------------------------------
>
> Key: KARAF-7969
> URL: https://issues.apache.org/jira/browse/KARAF-7969
> Project: Karaf
> Issue Type: Bug
> Components: karaf
> Affects Versions: 4.2.16
> Reporter: Vineeth
> Priority: Major
>
> Hello
> I have an AIX 7.3 system where I’m able to replicate the issue using IBM J9
> Java 11
> From the Thread dump , i am able to locate one issue in
> *org.apache.karaf.main.Main* class in
> {code:java}
> public void awaitShutdown() throws Exception { if (framework == null) {
> return; } while (true) { FrameworkEvent event = framework.waitForStop(0); if
> (event.getType() == FrameworkEvent.STOPPED_UPDATE) { if (lock != null) {
> lock.release(); } while (framework.getState() != Bundle.STARTING &&
> framework.getState() != Bundle.ACTIVE) { Thread.sleep(10); } monitorThread =
> monitor(); } else { return; } } }
> {code}
>
> The main thread calls {{framework.waitForStop(0)}} with a timeout of 0,
> meaning it will wait indefinitely
> This calls into ThreadGate.await() in the Felix framework, which uses
> Object.wait() to block until notified
> The parameter 0 means "wait forever" there's no timeout safety mechanism,
> correct?For the main thread to unblock, the ThreadGate object needs to
> receive a notification via notify() or notifyAll()
> This notification should come from the Felix framework when it completes
> shutdown
> However, the Felix framework threads (FelixDispatchQueue,
> FelixFrameworkWiring, FelixStartLevel) are all waiting themselves.
> _*+Here it should have set some timeout value something like 30
> seconds(30000).?, What you think? Please feel free to correct me.+*_
> In thread dump,
> 3XMTHREADINFO "main" J9VMThread:0x0000000030010700,
> omrthread_t:0x00000100100B2CE0, java/lang/Thread:0x00000000F007C2D0,
> state:CW, prio=5
> 3XMJAVALTHREAD (java/lang/Thread getId:0x1, isDaemon:false)
> 3XMJAVALTHRCCL
> jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000F0075810)
> 3XMTHREADINFO1 (native thread ID:0x1320327, native priority:0x5,
> native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000181)
> 3XMTHREADINFO2 (native stack address range
> from:0x0000010010017D90, to:0x000001001009D788, size:0x859F8)
> 3XMCPUTIME CPU usage total: 0.449041000 secs, user: 0.373210000
> secs, system: 0.075831000 secs, current category="Application"3XMTHREADBLOCK
> Waiting on: org/apache/felix/framework/util/ThreadGate@0x00000000F05AF2D8
> Owned by: <unowned>
> 3XMHEAPALLOC Heap bytes allocated since last GC cycle=0 (0x0)
> 3XMTHREADINFO3 Java callstack:
> 4XESTACKTRACE at java/lang/Object.waitImpl(Native Method)
> 4XESTACKTRACE at java/lang/Object.wait(Object.java:251)
> 4XESTACKTRACE at java/lang/Object.wait(Object.java:219)
> 4XESTACKTRACE at
> org/apache/felix/framework/util/ThreadGate.await(ThreadGate.java:79)
> 4XESTACKTRACE at
> org/apache/felix/framework/Felix.waitForStop(Felix.java:1075)
> 4XESTACKTRACE at
> org/apache/karaf/main/Main.awaitShutdown(Main.java:671)
> 4XESTACKTRACE at org/apache/karaf/main/Main.main(Main.java:190)
> 3XMTHREADINFO3 Native callstack:
> 4XENATIVESTACK _event_wait+0x2c (0x09000000006AF470
> [libpthreads.a+0x19470])
> 4XENATIVESTACK _cond_wait_local+0x2e4 (0x09000000006B75E8
> [libpthreads.a+0x215e8])
> 4XENATIVESTACK _cond_wait+0x34 (0x09000000006B7D38
> [libpthreads.a+0x21d38])
> 4XENATIVESTACK pthread_cond_wait+0x1a8 (0x09000000006B880C
> [libpthreads.a+0x2280c])
> 4XENATIVESTACK IPRA.$monitor_wait_original+0xa10
> (0x0900000000D88CD4 [libj9thr29.so+0x9cd4])
> 4XENATIVESTACK omrthread_monitor_wait_interruptable+0x50
> (0x0900000000D89474 [libj9thr29.so+0xa474])
> 4XENATIVESTACK monitorWaitImpl+0x488 (0x090000000D84668C
> [libj9vm29.so+0x8168c])
> 4XENATIVESTACK (0x090000000D9BC924 [libj9vm29.so+0x1f7924])
> 4XENATIVESTACK (0x090000000D8A9168 [libj9vm29.so+0xe4168])
> 4XENATIVESTACK runCallInMethod+0x2d0 (0x090000000D8073F4
> [libj9vm29.so+0x423f4])
> 4XENATIVESTACK _ZL26gpProtectedRunCallInMethodPv+0x4c
> (0x090000000D7C66F0 [libj9vm29.so+0x16f0])
> 4XENATIVESTACK signalProtectAndRunGlue+0x28 (0x090000000D822A8C
> [libj9vm29.so+0x5da8c])
> 4XENATIVESTACK omrsig_protect+0x4fc (0x090000000D555760
> [libj9prt29.so+0x5f760])
> 4XENATIVESTACK gpProtectAndRun+0xf0 (0x090000000D8227F4
> [libj9vm29.so+0x5d7f4])
> 4XENATIVESTACK gpCheckCallin+0x118 (0x090000000D7C665C
> [libj9vm29.so+0x165c])
> 4XENATIVESTACK callStaticVoidMethod+0x44 (0x090000000D8E9728
> [libj9vm29.so+0x124728])
> 4XENATIVESTACK JavaMain+0xc14 (0x0000010000007258 [java+0x7258])
> 4XENATIVESTACK ThreadJavaMain+0xc (0x000001000000DCB0
> [java+0xdcb0])
> 4XENATIVESTACK (0x0900000000089214 [libpthreads.a+0x214])
> I'm not entirely sure how everything is linked together, but it seems to form
> a larger race condition, which eventually leads to a hang state in the Karaf
> bundles. This issue can be easily reproduced in server mode.
> Simply use the {{/bin/start}} script, monitor the status(/bin/status), and
> once the server has started completely, run the {{/bin/stop}} script in a
> loop. After some iteration the system enters a hang state. I managed to
> capture thread and heap dumps during this, which gave me some insights that
> I’ve shared here.
> Thanks
--
This message was sent by Atlassian Jira
(v8.20.10#820010)