[ https://issues.apache.org/jira/browse/KARAF-7948 ]


    Vineeth deleted comment on KARAF-7948:
    --------------------------------

was (Author: JIRAUSER309301):
Hi [~jbonofre] ,
Thanks for checking the issue. 
Just adding a few more points to this case I have an AIX 7.3 system where I’m 
able to replicate the issue using IBM J9 Java 11

>From the Thread dump , i am able to locate one issue in 
>*org.apache.karaf.main.Main* class in 
{code:java}
public void awaitShutdown() throws Exception {
    if (framework == null) {
        return;
    }
    while (true) {
        FrameworkEvent event = framework.waitForStop(0);
        if (event.getType() == FrameworkEvent.STOPPED_UPDATE) {
            if (lock != null) {
                lock.release();
            }
            while (framework.getState() != Bundle.STARTING && 
framework.getState() != Bundle.ACTIVE) {
                Thread.sleep(10);
            }
            monitorThread = monitor();
        } else {
            return;
        }
    }
}{code}

The main thread calls {{framework.waitForStop(0)}} with a timeout of 0, meaning 
it will wait indefinitely
This calls into ThreadGate.await() in the Felix framework, which uses 
Object.wait() to block until notified
The parameter 0 means "wait forever"  there's no timeout safety mechanism, 
correct?For the main thread to unblock, the ThreadGate object needs to receive 
a notification via notify() or notifyAll()
This notification should come from the Felix framework when it completes 
shutdown
However, the Felix framework threads (FelixDispatchQueue, FelixFrameworkWiring, 
FelixStartLevel) are all waiting themselves.

_*+Here it should have set some timeout value something like 30 
seconds(30000).?, What you think? Please feel free to correct me.+*_

In thread dump,


{code:java}
3XMTHREADINFO      "main" J9VMThread:0x0000000030010700, 
omrthread_t:0x00000100100B2CE0, java/lang/Thread:0x00000000F007C2D0, state:CW, 
prio=5
3XMJAVALTHREAD            (java/lang/Thread getId:0x1, isDaemon:false)
3XMJAVALTHRCCL            
jdk/internal/loader/ClassLoaders$AppClassLoader(0x00000000F0075810)
3XMTHREADINFO1            (native thread ID:0x1320327, native priority:0x5, 
native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000181)
3XMTHREADINFO2            (native stack address range from:0x0000010010017D90, 
to:0x000001001009D788, size:0x859F8)
3XMCPUTIME               CPU usage total: 0.449041000 secs, user: 0.373210000 
secs, system: 0.075831000 secs, current category="Application"
3XMTHREADBLOCK     Waiting on: 
org/apache/felix/framework/util/ThreadGate@0x00000000F05AF2D8 Owned by: 
<unowned>
3XMHEAPALLOC             Heap bytes allocated since last GC cycle=0 (0x0)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/lang/Object.waitImpl(Native Method)
4XESTACKTRACE                at java/lang/Object.wait(Object.java:251)
4XESTACKTRACE                at java/lang/Object.wait(Object.java:219)
4XESTACKTRACE                at 
org/apache/felix/framework/util/ThreadGate.await(ThreadGate.java:79)
4XESTACKTRACE                at 
org/apache/felix/framework/Felix.waitForStop(Felix.java:1075)
4XESTACKTRACE                at 
org/apache/karaf/main/Main.awaitShutdown(Main.java:671)
4XESTACKTRACE                at org/apache/karaf/main/Main.main(Main.java:190)
3XMTHREADINFO3           Native callstack:
4XENATIVESTACK               _event_wait+0x2c (0x09000000006AF470 
[libpthreads.a+0x19470])
4XENATIVESTACK               _cond_wait_local+0x2e4 (0x09000000006B75E8 
[libpthreads.a+0x215e8])
4XENATIVESTACK               _cond_wait+0x34 (0x09000000006B7D38 
[libpthreads.a+0x21d38])
4XENATIVESTACK               pthread_cond_wait+0x1a8 (0x09000000006B880C 
[libpthreads.a+0x2280c])
4XENATIVESTACK               IPRA.$monitor_wait_original+0xa10 
(0x0900000000D88CD4 [libj9thr29.so+0x9cd4])
4XENATIVESTACK               omrthread_monitor_wait_interruptable+0x50 
(0x0900000000D89474 [libj9thr29.so+0xa474])
4XENATIVESTACK               monitorWaitImpl+0x488 (0x090000000D84668C 
[libj9vm29.so+0x8168c])
4XENATIVESTACK               (0x090000000D9BC924 [libj9vm29.so+0x1f7924])
4XENATIVESTACK               (0x090000000D8A9168 [libj9vm29.so+0xe4168])
4XENATIVESTACK               runCallInMethod+0x2d0 (0x090000000D8073F4 
[libj9vm29.so+0x423f4])
4XENATIVESTACK               _ZL26gpProtectedRunCallInMethodPv+0x4c 
(0x090000000D7C66F0 [libj9vm29.so+0x16f0])
4XENATIVESTACK               signalProtectAndRunGlue+0x28 (0x090000000D822A8C 
[libj9vm29.so+0x5da8c])
4XENATIVESTACK               omrsig_protect+0x4fc (0x090000000D555760 
[libj9prt29.so+0x5f760])
4XENATIVESTACK               gpProtectAndRun+0xf0 (0x090000000D8227F4 
[libj9vm29.so+0x5d7f4])
4XENATIVESTACK               gpCheckCallin+0x118 (0x090000000D7C665C 
[libj9vm29.so+0x165c])
4XENATIVESTACK               callStaticVoidMethod+0x44 (0x090000000D8E9728 
[libj9vm29.so+0x124728])
4XENATIVESTACK               JavaMain+0xc14 (0x0000010000007258 [java+0x7258])
4XENATIVESTACK               ThreadJavaMain+0xc (0x000001000000DCB0 
[java+0xdcb0])
4XENATIVESTACK               (0x0900000000089214 [libpthreads.a+0x214]){code}
I'm not entirely sure how everything is linked together, but it seems to form a 
larger race condition, which eventually leads to a hang state in the Karaf 
bundles. This issue can be easily reproduced in server mode.

Simply use the {{/bin/start}} script, monitor the status(/bin/status), and once 
the server has started completely, run the {{/bin/stop}} script in a loop. 
After some iteration the system enters a hang state. I managed to capture 
thread and heap dumps during this, which gave me some insights that I’ve shared 
here.

Thanks

> Apache karaf got into stuck state
> ---------------------------------
>
>                 Key: KARAF-7948
>                 URL: https://issues.apache.org/jira/browse/KARAF-7948
>             Project: Karaf
>          Issue Type: Bug
>          Components: karaf
>    Affects Versions: 4.2.16
>            Reporter: Vineeth
>            Assignee: Jean-Baptiste Onofré
>            Priority: Major
>
> We have a customer case where Apache Karaf(agent) is installed as a *systemd* 
> [service|https://github.com/apache/karaf/blob/main/assemblies/features/base/src/main/resources/resources/bin/contrib/karaf-service-template.systemd].
>  When the virtual machine (VM) is restarted, where Apache Karaf is set up as 
> a service, the Karaf process sometimes enters a {*}stuck state{*}. Although 
> the PID is still running, Karaf becomes completely unresponsive.
> There is no definite pattern — it happens intermittently. We investigated 
> multiple heap dumps but couldn't find any clear clues. However, in one 
> instance, we observed *two instances* of the {{FeaturesServiceImpl}} class, 
> which suggests that the Activator may have initialized 
> {{FeaturesServiceImpl}} {*}twice{*}, even though this only occurred once 
> during our tests.
> Later, we checked the state of each bundle. We noticed that the 
> mvn:org.apache.karaf.features/org.apache.karaf.features.extension/4.2.16 
> bundle and the mvn:org.ops4j.pax.logging/pax-logging-log4j2-extra/1.11.15 
> bundle were in *state 2* (installed), while all other bundles were in *state 
> 32* (active). This points to a possible *race condition* or a low-level crash 
> occurring during startup.
> We were able to reproduce the issue by looping a restart of the systemd 
> service using {{systemctl restart agent.service}} every 4 minutes. However, 
> we have been unable to pinpoint the exact cause of the problem.
> Please review this and let us know if you can help us further diagnose the 
> issue.
> Details
> Customer uses Suse Linux on z , installed as rpm package using zypper
> Apache karaf version: 4.2.16
> Log : There no log wriiten after startup. It will simply writes nothing.
> We can share more details, if needed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to