Re: [lldb-dev] A nasty failure in TestConcurrentEvents on linux

Vince Harron Thu, 19 Mar 2015 09:22:12 -0700

I think you should look at git blame on relevant lines in the kernel and
ask for help from those people and/or loop them onto this thread.
On Mar 19, 2015 8:53 AM, "Pavel Labath" <[email protected]> wrote:


> After a considerate amount of effort, I have managed to get a log trace of
> a rare failure in TestConcurrentEvents
> (test_two_watchpoints_one_signal_dwarf). Unfortunately, I do not see a
> clear way how to avoid it, so I'm writing here in case anyone has an idea.
>
> Here is my interpretation of the events, backed by log snippets (full log
> in the attachment, I also can provide the corresponding core file if anyone
> wants to dig around).
>
> The inferior starts out by creating three threads. Of those, two are set
> up to hit a watchpoint and one signals itself.
>
> 411 1426700114.595431000 [7295/729c]: MonitorChildProcessThreadFunction
> ::waitpid (pid = -29339, &status, options = 1073741824) => pid = 29346,
> status = 0x00000a7f (STOPPED), signal = 10, exit_state = 0
>
> On line 411 we see llgs catching the USR1(10) signal.
>
> 412 1426700114.595450000 [7295/729c]: NativeProcessLinux::MonitorSignal()
> received signal SIGUSR1 (10) with code SI_TKILL, (siginfo pid = 29339 (not
> from llgs), waitpid pid = 29346)
>
> Here, we see that the ptrace(GETSIGINFO) call was successful, we get info
> about the signal and continue on our way, stop all other threads (which hit
> a watchpoint by the way), report this to lldb, etc. After some back and
> forth, lldb tells us to resume the inferior and let it process the SIGUSR1
> signal.
>
> 548 1426700116.644772000 [7295/729a]: NativeProcessLinux::Resume()
> resuming thread = 29346 with signal SIGUSR1
>
> However, here things get strange. Immediately after that we get a second
> waitpid notification,
>
> 555 1426700116.644896000 [7295/729c]: MonitorChildProcessThreadFunction
> ::waitpid (pid = -29339, &status, options = 1073741824) => pid = 29346,
> status = 0x00000a7f (STOPPED), signal = 10, exit_state = 0
>
> This shouldn't happen as we have just reinjected the signal for the
> process to handle, and no other signal was sent to the thread in the
> meantime. But we don't know that so we try to process the signal as usual,
> by calling ptrace(GETSIGINFO).
>
> 558 1426700116.644932000 [7295/729c]: ThreadStateCoordinator::EnqueueEvent
> enqueued event: EventThreadDeath (tid=29346)
> 561 1426700116.644961000 [7295/729c]: NativeProcessLinux::MonitorCallback
> GetSignalInfo failed: No such process, tid = 29346, signal = 10, status = 0
> (thread/process killed, is not main thread, thread metadata removed)
>
> These lines tell us that we are now in the branch where ptrace(GETSIGINFO)
> failed with ESRCH. This can happen for two reasons:
> - the thread in question has died without us noticing
> - the thread is running
> Since we have just received a waitpid notification, we know that the
> thread should not be running and we assume that the thread has died between
> waitpid and getsiginfo calls. We clean up our internal structures and
> forget about the thread.
>
> However, the thread is very much alive and kicking and we get another
> notification from the thread on line 567. Now we try to find the thread in
> our data structures and assert because it does not exist.
>
>
> These events would seem to indicate that the USR1 notification on line 555
> was bogus - there should be no additional signal, and even if there was, it
> should have stopped the thread, but it didn't do that.
>
> So it seems that we should just ignore bogus signals (= assume they don't
> stop the thread and they don't kill it). However, it is not easy to
> differentiate between the three scenarios:
> - real signal - waitpid returns, getsiginfo suceeds - thread is stopped
> - bogus signal - waitpid returns, getsiginfo fails  - thread is running
> - thread death - waitpid returns, getsiginfo fails  - thread is dead
>
>
> Does anyone have an idea how to address this? I have a few, but I don't
> like any of them very much, so I'd rather keep them to myself for now.
> Besides, this mail has gotten too long already. If you're still reading
> this, give yourself a cookie. :)
>
> pl
>
>
> _______________________________________________
> lldb-dev mailing list
> [email protected]
> http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev
>
>

_______________________________________________
lldb-dev mailing list
[email protected]
http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev

Re: [lldb-dev] A nasty failure in TestConcurrentEvents on linux

Reply via email to