Re: Figure out what killed an app (rhbz#2253099)

2024-02-08 Thread Yanko Kaneti
On Mon, 2024-02-05 at 16:20 +0200, Yanko Kaneti wrote:
> On Mon, 2024-02-05 at 14:31 +0100, Florian Weimer wrote:
> > > 
> > > Running with rtkit-daemon stoped and masked prevents the main evo
> > > process being prioritized and consequently killed with higher load.
> > > 
> > > The fact that the kernel kills the (main)thread without a beep anywhere
> > > in the logs is strange to me.
> > 
> > I think it's not so unusual for the kernel not to log the SIGKILL.  This
> > does not seem to be purely an emergency measure like the OOM killer—it's
> > more like something that an application (set of processes) told the
> > kernel to do to itself.  As such, it's not unreasonable for the
> > application to do the logging.  On the other hand, I think it's
> > difficult for rtkit-daemon to arrange for that logging because it can't
> > monitor the target process directly.
> 
> Logical but leaves users like me scratching there head with mishaps like
> these. Given that print-fatal-signals kernel param (which I just learned
> about) is off by default.

FWIW, I've been running for the last couple days  with print-fatal-
signals=1 and the only additional logs I get are exactly the ones that
would've helped in this case.

kernel: RT Watchdog Timeout (hard): evolution[4604]
kernel: RT Watchdog Timeout (hard): liferea[5045]

i.e. RT watchdog kills do seem like an out-of-the-ordinary events and
there is case to be made that logging in should be on by default.

-Yanko

> 
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-02-05 Thread Yanko Kaneti
On Mon, 2024-02-05 at 14:31 +0100, Florian Weimer wrote:
> * Yanko Kaneti:
> 
> > On Sat, 2024-02-03 at 13:29 +0100, Ondrej Mosnáček wrote:
> > > On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti  wrote:
> > > > 
> > > > On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
> > > > > On Thu, 1 Feb 2024 at 09:13, Milan Crha  wrote:
> > > > > > The kernel tracing log for sig==9 shows:
> > > > > > 
> > > > > > gnome-terminal--2924[002] dN.2.  2520.462889: signal_generate:
> > > > > > sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
> > > > > > 
> > > > > > There is no such thing (apart of the tracing log) when Evolution is
> > > > > > suddenly killed, the logs are muted. That makes me believe it's not 
> > > > > > the
> > > > > > OOM killer whom kills the application. I'm back at square 1; or 
> > > > > > maybe
> > > > > > square 2, as I know possibly kernel sends it, but not why.
> > > > > 
> > > > > Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as
> > > > > root) and then collect the kernel trace again. That should give you a
> > > > > backtrace of kernel functions from the signal generation, which could
> > > > > help you/us to figure out the reason the process was killed.
> > > > 
> > > > So, figured the easiest way to help trigger the kill here is to put load
> > > > on the machine.
> > > > 
> > > >  $ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
> > > > 
> > > > then starting evolution seems to do it almost every time shortly after
> > > > start (I have around 200k messages in the folder its trying to display)
> > > > 
> > > > I've enabled the stacktrace tracing option and like Milan set a sig==9
> > > > filter. And here is what I got in the trace buffer after it was killed
> > > > 
> > > > # tracer: nop
> > > > #
> > > > # entries-in-buffer/entries-written: 34/34   #P:16
> > > > #
> > > > #_-=> irqs-off/BH-disabled
> > > > #   / _=> need-resched
> > > > #  | / _---=> hardirq/softirq
> > > > #  || / _--=> preempt-depth
> > > > #  ||| / _-=> migrate-disable
> > > > #   / delay
> > > > #   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
> > > > #  | | |   | | |
> > > >evolution-9096[002] d..1.  1207.016489: signal_generate: 
> > > > sig=9 errno=0 code=128 comm=evolution pid=9096 grp=1 res=0
> > > >evolution-9096[002] d..1.  1207.016495: 
> > > >  => trace_event_raw_event_signal_generate
> > > >  => __send_signal_locked
> > > >  => posix_cpu_timers_work
> > > >  => task_work_run
> > > >  => irqentry_exit_to_user_mode
> > > >  => asm_sysvec_apic_timer_interrupt
> > > 
> > > So, browsing through the relevant kernel code, it seems the only cases
> > > which could have led to this backtrace are:
> > > 1. When a task's RT timeout goes over the RLIMIT_RTTIME hard limit
> > > (see function check_thread_timers() in
> > > kernel/time/posix-cpu-timers.c).
> > > 2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see
> > > function check_process_timers() in kernel/time/posix-cpu-timers.c).
> > > 
> > > I may have missed some code path, but these resource limits should be
> > > the next thing to check.
> > 
> > Thanks a lot for diving it. This is probably it.
> > The kernel killing the main evolution process which was set as having rt
> > priority by some webkit coincidence.
> > 
> > Running with rtkit-daemon stoped and masked prevents the main evo
> > process being prioritized and consequently killed with higher load.
> > 
> > The fact that the kernel kills the (main)thread without a beep anywhere
> > in the logs is strange to me.
> 
> Interesting analysis, thanks!
> 
> I think it's not so unusual for the kernel not to log the SIGKILL.  This
> does not seem to be purely an emergency measure like the OOM killer—it's
> more like something that an application (set of processes) told the
> kernel to do to itself.  As such, it's not unreasonable for the
> application to do the logging.  On the other hand, I think it's
> difficult for rtkit-daemon to arrange for that logging because it can't
> monitor the target process directly.

Logical but leaves users like me scratching there head with mishaps like
these. Given that print-fatal-signals kernel param (which I just learned
about) is off by default.

> 
> Any idea why the signal is SIGKILL and not SIGXCPU?  To ensure that the
> process actually terminates promptly, without running signal handlers or
> generating a coredump?  SIGXCPU would point more directly to the cause.

No idea. I just saw kernel/time/poxis-cpu-timers.c after Ondrej
mentioned it. SIGXCPU for soft limits and KILL for hard it seems.


- Yanko
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to 

Re: Figure out what killed an app (rhbz#2253099)

2024-02-05 Thread Florian Weimer
* Yanko Kaneti:

> On Sat, 2024-02-03 at 13:29 +0100, Ondrej Mosnáček wrote:
>> On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti  wrote:
>> > 
>> > On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
>> > > On Thu, 1 Feb 2024 at 09:13, Milan Crha  wrote:
>> > > > The kernel tracing log for sig==9 shows:
>> > > > 
>> > > > gnome-terminal--2924[002] dN.2.  2520.462889: signal_generate:
>> > > > sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
>> > > > 
>> > > > There is no such thing (apart of the tracing log) when Evolution is
>> > > > suddenly killed, the logs are muted. That makes me believe it's not the
>> > > > OOM killer whom kills the application. I'm back at square 1; or maybe
>> > > > square 2, as I know possibly kernel sends it, but not why.
>> > > 
>> > > Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as
>> > > root) and then collect the kernel trace again. That should give you a
>> > > backtrace of kernel functions from the signal generation, which could
>> > > help you/us to figure out the reason the process was killed.
>> > 
>> > So, figured the easiest way to help trigger the kill here is to put load
>> > on the machine.
>> > 
>> >  $ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
>> > 
>> > then starting evolution seems to do it almost every time shortly after
>> > start (I have around 200k messages in the folder its trying to display)
>> > 
>> > I've enabled the stacktrace tracing option and like Milan set a sig==9
>> > filter. And here is what I got in the trace buffer after it was killed
>> > 
>> > # tracer: nop
>> > #
>> > # entries-in-buffer/entries-written: 34/34   #P:16
>> > #
>> > #_-=> irqs-off/BH-disabled
>> > #   / _=> need-resched
>> > #  | / _---=> hardirq/softirq
>> > #  || / _--=> preempt-depth
>> > #  ||| / _-=> migrate-disable
>> > #   / delay
>> > #   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
>> > #  | | |   | | |
>> >evolution-9096[002] d..1.  1207.016489: signal_generate: sig=9 
>> > errno=0 code=128 comm=evolution pid=9096 grp=1 res=0
>> >evolution-9096[002] d..1.  1207.016495: 
>> >  => trace_event_raw_event_signal_generate
>> >  => __send_signal_locked
>> >  => posix_cpu_timers_work
>> >  => task_work_run
>> >  => irqentry_exit_to_user_mode
>> >  => asm_sysvec_apic_timer_interrupt
>> 
>> So, browsing through the relevant kernel code, it seems the only cases
>> which could have led to this backtrace are:
>> 1. When a task's RT timeout goes over the RLIMIT_RTTIME hard limit
>> (see function check_thread_timers() in
>> kernel/time/posix-cpu-timers.c).
>> 2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see
>> function check_process_timers() in kernel/time/posix-cpu-timers.c).
>> 
>> I may have missed some code path, but these resource limits should be
>> the next thing to check.
>
> Thanks a lot for diving it. This is probably it.
> The kernel killing the main evolution process which was set as having rt
> priority by some webkit coincidence.
>
> Running with rtkit-daemon stoped and masked prevents the main evo
> process being prioritized and consequently killed with higher load.
>
> The fact that the kernel kills the (main)thread without a beep anywhere
> in the logs is strange to me.

Interesting analysis, thanks!

I think it's not so unusual for the kernel not to log the SIGKILL.  This
does not seem to be purely an emergency measure like the OOM killer—it's
more like something that an application (set of processes) told the
kernel to do to itself.  As such, it's not unreasonable for the
application to do the logging.  On the other hand, I think it's
difficult for rtkit-daemon to arrange for that logging because it can't
monitor the target process directly.

Any idea why the signal is SIGKILL and not SIGXCPU?  To ensure that the
process actually terminates promptly, without running signal handlers or
generating a coredump?  SIGXCPU would point more directly to the cause.

Thanks,
Florian
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-02-04 Thread Milan Crha
On Sun, 2024-02-04 at 14:03 +0200, Yanko Kaneti wrote:
> The kernel killing the main evolution process which was set as having
> rt priority by some webkit coincidence.

Hi,
there it is then. I see those:

>   Jan 31 10:49:22 localhost.localdomain rtkit-daemon[826]:
>   Successfully made thread 4820 of process 4640 (/usr/bin/evolution)
>   owned by '1000' RT at priority 5.

in the journal make more sense now.

Thank you all for the help on this.

Bye,
Milan
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-02-04 Thread Yanko Kaneti
On Sat, 2024-02-03 at 13:29 +0100, Ondrej Mosnáček wrote:
> On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti  wrote:
> > 
> > On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
> > > On Thu, 1 Feb 2024 at 09:13, Milan Crha  wrote:
> > > > The kernel tracing log for sig==9 shows:
> > > > 
> > > > gnome-terminal--2924[002] dN.2.  2520.462889: signal_generate:
> > > > sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
> > > > 
> > > > There is no such thing (apart of the tracing log) when Evolution is
> > > > suddenly killed, the logs are muted. That makes me believe it's not the
> > > > OOM killer whom kills the application. I'm back at square 1; or maybe
> > > > square 2, as I know possibly kernel sends it, but not why.
> > > 
> > > Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as
> > > root) and then collect the kernel trace again. That should give you a
> > > backtrace of kernel functions from the signal generation, which could
> > > help you/us to figure out the reason the process was killed.
> > 
> > So, figured the easiest way to help trigger the kill here is to put load
> > on the machine.
> > 
> >  $ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
> > 
> > then starting evolution seems to do it almost every time shortly after
> > start (I have around 200k messages in the folder its trying to display)
> > 
> > I've enabled the stacktrace tracing option and like Milan set a sig==9
> > filter. And here is what I got in the trace buffer after it was killed
> > 
> > # tracer: nop
> > #
> > # entries-in-buffer/entries-written: 34/34   #P:16
> > #
> > #_-=> irqs-off/BH-disabled
> > #   / _=> need-resched
> > #  | / _---=> hardirq/softirq
> > #  || / _--=> preempt-depth
> > #  ||| / _-=> migrate-disable
> > #   / delay
> > #   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
> > #  | | |   | | |
> >evolution-9096[002] d..1.  1207.016489: signal_generate: sig=9 
> > errno=0 code=128 comm=evolution pid=9096 grp=1 res=0
> >evolution-9096[002] d..1.  1207.016495: 
> >  => trace_event_raw_event_signal_generate
> >  => __send_signal_locked
> >  => posix_cpu_timers_work
> >  => task_work_run
> >  => irqentry_exit_to_user_mode
> >  => asm_sysvec_apic_timer_interrupt
> 
> So, browsing through the relevant kernel code, it seems the only cases
> which could have led to this backtrace are:
> 1. When a task's RT timeout goes over the RLIMIT_RTTIME hard limit
> (see function check_thread_timers() in
> kernel/time/posix-cpu-timers.c).
> 2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see
> function check_process_timers() in kernel/time/posix-cpu-timers.c).
> 
> I may have missed some code path, but these resource limits should be
> the next thing to check.

Thanks a lot for diving it. This is probably it.
The kernel killing the main evolution process which was set as having rt
priority by some webkit coincidence.

Running with rtkit-daemon stoped and masked prevents the main evo
process being prioritized and consequently killed with higher load.

The fact that the kernel kills the (main)thread without a beep anywhere
in the logs is strange to me.

- Yanko
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-02-03 Thread Ondrej Mosnáček
On Fri, 2 Feb 2024 at 17:52, Yanko Kaneti  wrote:
>
> On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
> > On Thu, 1 Feb 2024 at 09:13, Milan Crha  wrote:
> > > The kernel tracing log for sig==9 shows:
> > >
> > > gnome-terminal--2924[002] dN.2.  2520.462889: signal_generate:
> > > sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
> > >
> > > There is no such thing (apart of the tracing log) when Evolution is
> > > suddenly killed, the logs are muted. That makes me believe it's not the
> > > OOM killer whom kills the application. I'm back at square 1; or maybe
> > > square 2, as I know possibly kernel sends it, but not why.
> >
> > Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as
> > root) and then collect the kernel trace again. That should give you a
> > backtrace of kernel functions from the signal generation, which could
> > help you/us to figure out the reason the process was killed.
>
> So, figured the easiest way to help trigger the kill here is to put load
> on the machine.
>
>  $ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95
>
> then starting evolution seems to do it almost every time shortly after
> start (I have around 200k messages in the folder its trying to display)
>
> I've enabled the stacktrace tracing option and like Milan set a sig==9
> filter. And here is what I got in the trace buffer after it was killed
>
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 34/34   #P:16
> #
> #_-=> irqs-off/BH-disabled
> #   / _=> need-resched
> #  | / _---=> hardirq/softirq
> #  || / _--=> preempt-depth
> #  ||| / _-=> migrate-disable
> #   / delay
> #   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
> #  | | |   | | |
>evolution-9096[002] d..1.  1207.016489: signal_generate: sig=9 
> errno=0 code=128 comm=evolution pid=9096 grp=1 res=0
>evolution-9096[002] d..1.  1207.016495: 
>  => trace_event_raw_event_signal_generate
>  => __send_signal_locked
>  => posix_cpu_timers_work
>  => task_work_run
>  => irqentry_exit_to_user_mode
>  => asm_sysvec_apic_timer_interrupt

So, browsing through the relevant kernel code, it seems the only cases
which could have led to this backtrace are:
1. When a task's RT timeout goes over the RLIMIT_RTTIME hard limit
(see function check_thread_timers() in
kernel/time/posix-cpu-timers.c).
2. When a task's CPU time goes over the RLIMIT_CPU hard limit (see
function check_process_timers() in kernel/time/posix-cpu-timers.c).

I may have missed some code path, but these resource limits should be
the next thing to check.

>evolution-9096[002] d..2.  1207.016564: signal_generate: sig=9 
> errno=0 code=0 comm=bwrap pid=9145 grp=1 res=0
>evolution-9096[002] d..2.  1207.016568: 
>  => trace_event_raw_event_signal_generate
>  => __send_signal_locked
>  => do_send_sig_info
>  => do_exit
>  => do_group_exit
>  => get_signal
>  => arch_do_signal_or_restart
>  => irqentry_exit_to_user_mode
>  => asm_sysvec_apic_timer_interrupt
> ...
> and 32 other events of bwrap cleaning up.
>
> Does that shed any light? Other that is seems to be sending the signal
> to itself.
>
> -Yanko
> --
> ___
> devel mailing list -- devel@lists.fedoraproject.org
> To unsubscribe send an email to devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
> Do not reply to spam, report it: 
> https://pagure.io/fedora-infrastructure/new_issue
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-02-02 Thread Yanko Kaneti
On Thu, 2024-02-01 at 09:44 +0100, Ondrej Mosnáček wrote:
> On Thu, 1 Feb 2024 at 09:13, Milan Crha  wrote:
> > The kernel tracing log for sig==9 shows:
> > 
> > gnome-terminal--2924[002] dN.2.  2520.462889: signal_generate:
> > sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
> > 
> > There is no such thing (apart of the tracing log) when Evolution is
> > suddenly killed, the logs are muted. That makes me believe it's not the
> > OOM killer whom kills the application. I'm back at square 1; or maybe
> > square 2, as I know possibly kernel sends it, but not why.
> 
> Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as
> root) and then collect the kernel trace again. That should give you a
> backtrace of kernel functions from the signal generation, which could
> help you/us to figure out the reason the process was killed.

So, figured the easiest way to help trigger the kill here is to put load
on the machine. 

 $ stress-ng --cpu -1 --cpu-method all -t 5m --cpu-load 95

then starting evolution seems to do it almost every time shortly after
start (I have around 200k messages in the folder its trying to display)

I've enabled the stacktrace tracing option and like Milan set a sig==9
filter. And here is what I got in the trace buffer after it was killed

# tracer: nop
#
# entries-in-buffer/entries-written: 34/34   #P:16
#
#_-=> irqs-off/BH-disabled
#   / _=> need-resched
#  | / _---=> hardirq/softirq
#  || / _--=> preempt-depth
#  ||| / _-=> migrate-disable
#   / delay
#   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
#  | | |   | | |
   evolution-9096[002] d..1.  1207.016489: signal_generate: sig=9 
errno=0 code=128 comm=evolution pid=9096 grp=1 res=0
   evolution-9096[002] d..1.  1207.016495: 
 => trace_event_raw_event_signal_generate
 => __send_signal_locked
 => posix_cpu_timers_work
 => task_work_run
 => irqentry_exit_to_user_mode
 => asm_sysvec_apic_timer_interrupt
   evolution-9096[002] d..2.  1207.016564: signal_generate: sig=9 
errno=0 code=0 comm=bwrap pid=9145 grp=1 res=0
   evolution-9096[002] d..2.  1207.016568: 
 => trace_event_raw_event_signal_generate
 => __send_signal_locked
 => do_send_sig_info
 => do_exit
 => do_group_exit
 => get_signal
 => arch_do_signal_or_restart
 => irqentry_exit_to_user_mode
 => asm_sysvec_apic_timer_interrupt
...
and 32 other events of bwrap cleaning up.

Does that shed any light? Other that is seems to be sending the signal
to itself.

-Yanko
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-02-01 Thread Ondrej Mosnáček
On Thu, 1 Feb 2024 at 09:13, Milan Crha  wrote:
> The kernel tracing log for sig==9 shows:
>
> gnome-terminal--2924[002] dN.2.  2520.462889: signal_generate:
> sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0
>
> There is no such thing (apart of the tracing log) when Evolution is
> suddenly killed, the logs are muted. That makes me believe it's not the
> OOM killer whom kills the application. I'm back at square 1; or maybe
> square 2, as I know possibly kernel sends it, but not why.

Try running `echo stacktrace >/sys/kernel/tracing/trace_options` (as
root) and then collect the kernel trace again. That should give you a
backtrace of kernel functions from the signal generation, which could
help you/us to figure out the reason the process was killed.
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-02-01 Thread Milan Crha
On Wed, 2024-01-31 at 13:40 -0500, Paul Grosu wrote:
> 2) Or you can completely disable it:

Hi,
I do not want to disable the oom service. Remember, it does that on
user machines, not only on mine. Telling people: "you want to use app
a, b, c, then disable oom" as a new Fedora 40 feature is not great ;)

Nonetheless, I though I'll give it a try and will trigger the oom, to
see what it does. I wrote a little app to allocate gigabytes of memory
(until malloc() returns NULL) and one process was not enough, thus I
ran second and then third and then I've been notified by a gnome-shell
bubble telling me "Virtual Terminal Stopped" with a reason that memory
is almost full. Looking on the terminals the second process I ran had
been killed and the `dmesg` shows a reason:

[ 1712.960372] oom-
kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0
,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/ap
p.slice/app-org.gnome.Terminal.slice/vte-spawn-7f4b1469-38d1-498f-b363-
1c8311d2fcff.scope,task=alloc-too-much,pid=3323,uid=1000

[ 1712.960389] Out of memory: Killed process 3323 (alloc-too-much)
total-vm:137300925912kB, anon-rss:128kB, file-rss:1280kB, shmem-
rss:0kB, UID:1000 pgtables:1544608kB oom_score_adj:200

The journal contains similar information, plus a lot of debugging
information from kernel, starting with:

Feb 01 08:45:47 localhost.localdomain kernel: alloc-too-much invoked
oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP),
order=0,...

The kernel tracing log for sig==9 shows:

gnome-terminal--2924[002] dN.2.  2520.462889: signal_generate:
sig=9 errno=0 code=128 comm=alloc-too-much pid=3502 grp=1 res=0

There is no such thing (apart of the tracing log) when Evolution is
suddenly killed, the logs are muted. That makes me believe it's not the
OOM killer whom kills the application. I'm back at square 1; or maybe
square 2, as I know possibly kernel sends it, but not why.

Bye,
Milan
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Chris Murphy


On Wed, Jan 31, 2024, at 11:40 AM, Paul Grosu wrote:
> 
> 
> https://github.com/facebookincubator/oomd/
> 
> 2) Or you can completely disable it:
> 
> https://www.cjjackson.dev/posts/what-is-systemd-oomd-how-to-disable-it/

We need bug reports to get it fixed rather than disabling it, if it's causing 
things to get clobbered that shouldn't be.

The kernel OOM killer is strictly concerned about kernel survival in 
dramatically low memory situations. It is unconcerned about user space making 
progress. Hence user space oom managers.

I admit this is probably suboptimal in some cases. Full resource control is a 
better way of handling the ambiguity whether a program is incorrectly hogging 
resources versus simply doing its job. Resource control may not obviate a user 
space oom killer but I think what we really care about is giving a program all 
the resources it needs, up to the point that the user wants an interactive 
desktop - and then the kernel needs to (in effect) preempt the resource hungry 
processes in favor of user-desktop interactivity. Resource control can do this 
but we don't have everything wired up yet, more work is needed.


--
Chris Murphy
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Paul Grosu
I think your options are two:

1) Write a plugin for the oom service to capture the state (and log it)
before it kills the process.  Here's the source code to learn more about
that:

https://github.com/facebookincubator/oomd/

2) Or you can completely disable it:

https://www.cjjackson.dev/posts/what-is-systemd-oomd-how-to-disable-it/

The man pages which other folks referred to is here (in case you want to
tweak with configuring it):

https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html

Hope it helps,
~p

On Wed, Jan 31, 2024 at 1:04 PM Michael Catanzaro 
wrote:

> On Wed, Jan 31 2024 at 06:53:25 PM +01:00:00, Milan Crha
>  wrote:
> > Evo itself doesn't use any seccomp or such, these things can be used
> > by
> > the WebKitGTK. A quick grep revealed:
> >
> >
> https://github.com/WebKit/WebKit/blob/main/Source/WebKit/UIProcess/Launcher/glib/ProcessLauncherGLib.cpp#L258
> >
> > but that code does not seem to be called at this time (or my debug
> > code
> > was wrong, it's possible).
>
> That code is for terminating child processes. It's certainly not
> supposed to terminate itself.
>
> --
> ___
> devel mailing list -- devel@lists.fedoraproject.org
> To unsubscribe send an email to devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
> Do not reply to spam, report it:
> https://pagure.io/fedora-infrastructure/new_issue
>
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Michael Catanzaro
On Wed, Jan 31 2024 at 06:53:25 PM +01:00:00, Milan Crha 
 wrote:
Evo itself doesn't use any seccomp or such, these things can be used 
by

the WebKitGTK. A quick grep revealed:

https://github.com/WebKit/WebKit/blob/main/Source/WebKit/UIProcess/Launcher/glib/ProcessLauncherGLib.cpp#L258

but that code does not seem to be called at this time (or my debug 
code

was wrong, it's possible).


That code is for terminating child processes. It's certainly not 
supposed to terminate itself.


--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Milan Crha
On Wed, 2024-01-31 at 16:24 +0100, Petr Pisar wrote:
> A possible explanation is that the signals are procecessed
> asychnously and evolution manages to dispatch the signal to bwrap
> before kernel
> termites evolution because of the first signal. Or I misinterpret the
> log.

Hi,
there had been more lines involving bwrap and WebKitWebProcess, if I
recall correctly (I closed the machine already and did not save the
full log, I'm sorry). The above lines were those directly involving the
evolution process. I did not want to spam the list with a too long log.
Bye,
Milan
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Milan Crha
On Wed, 2024-01-31 at 08:31 -0600, Michael Catanzaro wrote:
> Maybe it could also be sent by mutter if a program 
> is unresponsive?

Hi,
the app is perfectly responsive. I click on a widget and the app is
killed immediately. There is no freeze of the app.

> WebKitGTK doesn't use SIGKILL to clean up after itself.

Evo itself doesn't use any seccomp or such, these things can be used by
the WebKitGTK. A quick grep revealed:

https://github.com/WebKit/WebKit/blob/main/Source/WebKit/UIProcess/Launcher/glib/ProcessLauncherGLib.cpp#L258

but that code does not seem to be called at this time (or my debug code
was wrong, it's possible).


There was an update on the bug #2253099 that also liferea is affected,
thus at least evo is not alone. There is suspect that either glib2 or
kernel update caused it, but who knows.
Bye,
Milan
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Stephen Smoogen
On Wed, 31 Jan 2024 at 12:00, Michael Catanzaro 
wrote:

> On Wed, Jan 31 2024 at 04:42:08 PM +01:00:00, Clemens Lang
>  wrote:
> > Throwing some ideas out there, is it possible that evolution runs
> > with a seccomp filter or other BPF program configured to kill the
> > process on violation, and that’s what’s happening here?
>
> I don't think so. flatpak does use seccomp filters [1], but on
> violations the syscalls should just fail with EPERM or ENOSYS, not kill
> the process.
>
>
Could the problem with evolution be really with something else and getting
masked out ? I saw there was another thread about bogofilter killing
evolution due to the database needing to change from berkeley to sqllite by
running a helper. If the helper is being run but it runs out of memory,
would that cause the whole chain of OOM ?




> [1]
>
> https://github.com/flatpak/flatpak/blob/d5f891e0035e50b24211688a7fa5f61924a2e51c/common/flatpak-run.c#L1791
>
> --
> ___
> devel mailing list -- devel@lists.fedoraproject.org
> To unsubscribe send an email to devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
> Do not reply to spam, report it:
> https://pagure.io/fedora-infrastructure/new_issue
>


-- 
Stephen Smoogen, Red Hat Automotive
Let us be kind to one another, for most of us are fighting a hard battle.
-- Ian MacClaren
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Michael Catanzaro
On Wed, Jan 31 2024 at 04:42:08 PM +01:00:00, Clemens Lang 
 wrote:
Throwing some ideas out there, is it possible that evolution runs 
with a seccomp filter or other BPF program configured to kill the 
process on violation, and that’s what’s happening here?


I don't think so. flatpak does use seccomp filters [1], but on 
violations the syscalls should just fail with EPERM or ENOSYS, not kill 
the process.


[1] 
https://github.com/flatpak/flatpak/blob/d5f891e0035e50b24211688a7fa5f61924a2e51c/common/flatpak-run.c#L1791


--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Clemens Lang
Hey,


> On 31. Jan 2024, at 16:24, Petr Pisar  wrote:
> 
> Key information is code=128. That code is probably si_code member described in
> sigaction(2). The manual lists a lot of values as constants. 128 value is
> SI_KERNEL according to /usr/include/asm-generic/siginfo.h. It is documented
> as "sent by the kernel from somewhere". It means the first signal does not
> come a user space. It's genuinly generated by kernel. E.g. when the process is
> killed for out-of-memory reason. However, in this case, I would expect kernel
> to log this event into dmesg. E.g. out-of-memory killer is very verbose in
> dmesg.
> 
> Strange thing is the second line. It reads that process 3211 (code=0 meads
> SI_USER, "sent by kill, sigsend, raise") sent SIGKILL to 3257. It's
> questionable how a process 3211 killed with the first signal can still emit
> signals. A possible explanation is that the signals are procecessed
> asychnously and evolution manages to dispatch the signal to bwrap before 
> kernel
> termites evolution because of the first signal. Or I misinterpret the
> log.


Throwing some ideas out there, is it possible that evolution runs with a 
seccomp filter or other BPF program configured to kill the process on 
violation, and that’s what’s happening here?

For software that regularly deals with untrusted input, it doesn’t seem 
unreasonable that the developers might have implemented something like that, 
and a seccomp kill might actually look like that.


-- 
Clemens Lang
RHEL Crypto Team
Red Hat


--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Petr Pisar
V Wed, Jan 31, 2024 at 01:07:42PM +0100, Milan Crha napsal(a):
>   evolution-3211 [003] d..1.   355.904404: signal_generate: sig=9 errno=0 
> code=128 comm=evolution pid=3211 grp=1 res=0
>   evolution-3211 [003] d..2.   355.904450: signal_generate: sig=9 errno=0 
> code=0 comm=bwrap pid=3257 grp=1 res=0
> 
Key information is code=128. That code is probably si_code member described in
sigaction(2). The manual lists a lot of values as constants. 128 value is
SI_KERNEL according to /usr/include/asm-generic/siginfo.h. It is documented
as "sent by the kernel from somewhere". It means the first signal does not
come a user space. It's genuinly generated by kernel. E.g. when the process is
killed for out-of-memory reason. However, in this case, I would expect kernel
to log this event into dmesg. E.g. out-of-memory killer is very verbose in
dmesg.

Strange thing is the second line. It reads that process 3211 (code=0 meads
SI_USER, "sent by kill, sigsend, raise") sent SIGKILL to 3257. It's
questionable how a process 3211 killed with the first signal can still emit
signals. A possible explanation is that the signals are procecessed
asychnously and evolution manages to dispatch the signal to bwrap before kernel
termites evolution because of the first signal. Or I misinterpret the
log.

-- Petr


signature.asc
Description: PGP signature
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Michael Catanzaro
SIGKILL is almost always sent by systemd-oomd (or the kernel OOM 
killer). That's the most likely explanation. Theoretically it could 
also be sent by systemd if a service didn't quit quickly enough 
following a SIGTERM. Maybe it could also be sent by mutter if a program 
is unresponsive?


WebKitGTK doesn't use SIGKILL to clean up after itself. The parent 
process just closes its IPC socket to the child process, then the child 
should either (a) exit normally, or (b) crash itself, if its watchdog 
thread detects the main thread has hung.


On Wed, Jan 31 2024 at 11:08:16 AM +01:00:00, Milan Crha 
 wrote:

Jan 31 10:49:23 localhost.localdomain xdg-desktop-por[2697]: Realtime
error: Could not map pid: Could not determine pid namespace: Could not
find instance-id in process's /.flatpak-info


This one is https://bugs.webkit.org/show_bug.cgi?id=238403 (fixed 
recently)


--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Milan Crha
On Wed, 2024-01-31 at 13:07 +0100, Milan Crha wrote:
> In such case, I might be able to catch this in gdb, right? Maybe with
> a breakpoint in the `kill` function, and any other?

Okay, I tried with the following (more variants, just in case):

   gdb evolution \
--ex "b kill" \
--ex "b exit" \
--ex "b __kill" \
--ex "b _kill" \
--ex "b _exit" \
--ex "b pidfd_send_signal" \
--ex "b tkill if sig==9" \
--ex r

and it did not catch any of these. The last lines are:

   [Thread 0x7fffd92006c0 (LWP 3086) exited]
   [New process 3080]

   Program terminated with signal SIGKILL, Killed.
   The program no longer exists.
   (gdb)

which may or may not mean it is sent by a different function, I guess.
I cannot tell how much time was between the kill signal and the "[New
process 3080]" line.

Bye,
Milan
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Milan Crha
On Wed, 2024-01-31 at 12:19 +0100, Petr Pisar wrote:
> This procedure works for me
> .
> The tracefs file system has a nice log.

Hi,
that works pretty well, thank you. Funny enough, if I read it
correctly, it says evo killed itself (the first three bwrap-s are
probably unrelated, it could be when I opened and closed a composer,
which let WebKitGTK create a new WebKitWebProcess; the last line might
be when WebKitGTK cleans up after itself, or all the bwraps are at last
created by the WebKitGTK):

  bwrap-3756 [002] d..2.   341.677197: signal_generate: sig=9 errno=0 
code=0 comm=bwrap pid=3757 grp=1 res=1
  bwrap-3753 [000] d..2.   341.677296: signal_generate: sig=9 errno=0 
code=0 comm=bwrap pid=3754 grp=1 res=1
  bwrap-3856 [002] d..2.   345.231590: signal_generate: sig=9 errno=0 
code=0 comm=bwrap pid=3857 grp=1 res=1
  evolution-3211 [003] d..1.   355.904404: signal_generate: sig=9 errno=0 
code=128 comm=evolution pid=3211 grp=1 res=0
  evolution-3211 [003] d..2.   355.904450: signal_generate: sig=9 errno=0 
code=0 comm=bwrap pid=3257 grp=1 res=0

In such case, I might be able to catch this in gdb, right? Maybe with a
breakpoint in the `kill` function, and any other? These things are very
low-level for me, I'm sorry.

Bye,
Milan
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Daniel P . Berrangé
On Wed, Jan 31, 2024 at 11:06:02AM +, Tom Hughes via devel wrote:
> On 31/01/2024 10:08, Milan Crha wrote:
> 
> > I tried to investigate a rawhide bug:
> > https://bugzilla.redhat.com/show_bug.cgi?id=2253099
> > which is about Evolution being killed "by something". That's the thing,
> > I do not know what killed it, thus even why it had been killed. It's
> > even not killed after certain steps, it's killed "randomly", on various
> > occasions.
> > 
> > I did search the internet, but they usually expect the killer is the
> > OOM service, which logs about the action either in the dmesg or in the
> > journal, but in this case there is no sign about whom killed it in
> > either of these logs.
> > 
> > The evolution terminal just says:
> > 
> > Killed
> 
> At the end of of the day it means a SIGKILL was sent to the process
> and that's not something that is logged anywhere as a matter of course
> so you're reliant on whatever sends it saying so.
> 
> You're right that OOM is the usual cause so if you've ruled that out
> you need to think about other things.
> 
> The problem is that SIGKILL is deliberately a very hard stop that
> nothing can trap so normal things like using strace or gdb to catch
> who went it aren't going to work.

The audit subsystem is probably the first choice to find out what's
killing it. Other than that, systemtap or eBPF scripts can be written
to trace this.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Petr Pisar
V Wed, Jan 31, 2024 at 11:08:16AM +0100, Milan Crha napsal(a):
>   Hi,
> I tried to investigate a rawhide bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=2253099
> which is about Evolution being killed "by something". That's the thing,
> I do not know what killed it, thus even why it had been killed. It's
> even not killed after certain steps, it's killed "randomly", on various
> occasions.
> 
> I did search the internet, but they usually expect the killer is the
> OOM service, which logs about the action either in the dmesg or in the
> journal, but in this case there is no sign about whom killed it in
> either of these logs.
>

This procedure works for me
. The
tracefs file system has a nice log. An example of /usr/bin/sleep receiving
SIGKILL from bash:

root@fedora-40:/sys/kernel/debug/tracing # cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1   #P:4
#
#_-=> irqs-off/BH-disabled
#   / _=> need-resched
#  | / _---=> hardirq/softirq
#  || / _--=> preempt-depth
#  ||| / _-=> migrate-disable
#   / delay
#   TASK-PID CPU#  |  TIMESTAMP  FUNCTION
#  | | |   | | |
bash-4867[002] d..1. 10571.864270: signal_generate: sig=9 
errno=0 code=0 comm=sleep pid=4866 grp=1 res=0

-- Petr


signature.asc
Description: PGP signature
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Re: Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Tom Hughes via devel

On 31/01/2024 10:08, Milan Crha wrote:


I tried to investigate a rawhide bug:
https://bugzilla.redhat.com/show_bug.cgi?id=2253099
which is about Evolution being killed "by something". That's the thing,
I do not know what killed it, thus even why it had been killed. It's
even not killed after certain steps, it's killed "randomly", on various
occasions.

I did search the internet, but they usually expect the killer is the
OOM service, which logs about the action either in the dmesg or in the
journal, but in this case there is no sign about whom killed it in
either of these logs.

The evolution terminal just says:

Killed


At the end of of the day it means a SIGKILL was sent to the process
and that's not something that is logged anywhere as a matter of course
so you're reliant on whatever sends it saying so.

You're right that OOM is the usual cause so if you've ruled that out
you need to think about other things.

The problem is that SIGKILL is deliberately a very hard stop that
nothing can trap so normal things like using strace or gdb to catch
who went it aren't going to work.

Tom

--
Tom Hughes (t...@compton.nu)
http://compton.nu/
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


Figure out what killed an app (rhbz#2253099)

2024-01-31 Thread Milan Crha
Hi,
I tried to investigate a rawhide bug:
https://bugzilla.redhat.com/show_bug.cgi?id=2253099
which is about Evolution being killed "by something". That's the thing,
I do not know what killed it, thus even why it had been killed. It's
even not killed after certain steps, it's killed "randomly", on various
occasions.

I did search the internet, but they usually expect the killer is the
OOM service, which logs about the action either in the dmesg or in the
journal, but in this case there is no sign about whom killed it in
either of these logs.

The evolution terminal just says:

   Killed

and the `dmesg | grep -i kill` only shows:

   [3.435200] systemd[1]: Listening on systemd-oomd.socket - Userspace
  Out-Of-Memory (OOM) Killer Socket.
   [6.051242] rfkill: input handler disabled
   [   22.579276] rfkill: input handler enabled
   [   23.539350] rfkill: input handler disabled

and `journalctl -xb | grep -i kill` doesn't reveal anything useful. The
`journalctl -xb | grep -i evolution` has as its last line:

   Jan 31 10:49:22 localhost.localdomain rtkit-daemon[826]:
   Successfully made thread 4820 of process 4640 (/usr/bin/evolution)
   owned by '1000' RT at priority 5.

which does not explain anything. Grepping for the 4640 (evo's PID)
didn't show anything either.

That being said, is there a way to figure out what killed the app,
please?

Thanks and bye,
Milan

P.S.: the above-pasted journalctl line is followed by these three,
which look suspicious, but maybe they are unrelated. I even do not know
whether they had been added immediately after the app was killed or
before it. At least the terminal should not matter, because evo is
killed when being started from the GUI too. Three log lines follow:

Jan 31 10:49:23 localhost.localdomain xdg-desktop-por[2697]: Realtime
error: Could not map pid: Could not determine pid namespace: Could not
find instance-id in process's /.flatpak-info

Jan 31 10:50:46 localhost.localdomain gnome-terminal-[2939]: void
terminal_screen_shell_preexec(VteTerminal*): assertion
 '!priv->between_preexec_and_precmd' failed

Jan 31 10:50:49 localhost.localdomain gnome-terminal-[2939]: void
terminal_screen_shell_preexec(VteTerminal*): assertion
 '!priv->between_preexec_and_precmd' failed
--
___
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue