[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-10-17 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

Libor Peltan  changed:

   What|Removed |Added

 Status|RESOLVED|VERIFIED

--- Comment #28 from Libor Peltan  ---
I confirm that the issue is fixed in Valgrind built from current Git master
branch.

Thank you very much for your cooperation and patience.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-10-15 Thread Philippe Waroquiers
https://bugs.kde.org/show_bug.cgi?id=458915

Philippe Waroquiers  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|REPORTED|RESOLVED

--- Comment #27 from Philippe Waroquiers  ---
(In reply to David Vasek from comment #26)
> I work on this issue together with Libor. If you checkout this branch:
> https://gitlab.nic.cz/knot/knot-dns/-/commits/valgrind_vgdb_bug , it
> includes a few changes that should help you with the debugging.
Thanks for the above.
With this, I was able to identify the likely culprit for the bug.
I have pushed a fix for this as 348775f34
The valgrind regression tests were run on debian/amd64, centos/ppc64 and
ubunty/arm64.
Also, the ctl/valgrind knot test has run 80 times without encountering the
abort
(while before the fix, each run of ctl/valgrind was triggering the bug).

It would be nice if you could check with the latest state of the git repository
if everything works
as expected now.

Thanks for the clear and detailed instructions to reproduce the bug, which was
critically needed to
reproduce this race condition.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-10-13 Thread David Vasek
https://bugs.kde.org/show_bug.cgi?id=458915

David Vasek  changed:

   What|Removed |Added

 CC||david.va...@nic.cz

--- Comment #26 from David Vasek  ---
(In reply to Philippe Waroquiers from comment #25)

> Is there an easy change to the test framework that would ensure all the
> output is either in stderr or valgrind file ?

Hello Philippe,
I work on this issue together with Libor. If you checkout this branch:
https://gitlab.nic.cz/knot/knot-dns/-/commits/valgrind_vgdb_bug , it includes a
few changes that should help you with the debugging.

1. All output from valgrind is written to knot1/stderr (the split output you
saw before was apparently made by valgrind when option `--log-file` was used).
2. The options for valgrind you have suggested are already set in the
framework. If you need to change them, it's obvious from the last commit where
they are.
3. There is a new "test" `ctl/valgrind` that is simple and triggers the bug
very fast. You can run it as `./runtests.py ctl/valgrind` from
knot-dns/tests-extra, same as the test before.
4. When knotd encounters the bug, it calls abort now. Then it's easy to find
the place in the valgrind output (knot1/stderr) by searching for 'abort'. If
you don't need it, you remove the two asserts added in the last commit.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-10-10 Thread Philippe Waroquiers
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #25 from Philippe Waroquiers  ---
(In reply to Libor Peltan from comment #24)
> You will probably need to run the test several time until it reproduces. It
> may also happen on some machines that it does never reproduce. For me, it
> easily reproduces on my laptop, but hardly on a powerful server.
Thanks for the detailed instructions. I was able to reproduce the bug, happens
more or less 1 on 5 trials.
I will further work on this time permitting (valgrind is a week-end activity,
so only a few hours from time to time).

One question: the test outputs valgrind messages in the file 'valgrind' and
outputs the valgrind debug output in stderr.
This setup is very special, I would like to have both outputs in the same file.
I have not seen where is the piece of code that really launches valgrind and
makes these 2 separate files
(I even do not too much understand how this is done, as valgrind messages and
debug output are normally both written on
the same fd).

Is there an easy change to the test framework that would ensure all the output
is either in stderr or valgrind file ?

Thanks
Philippe

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-10-04 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #24 from Libor Peltan  ---
Hi Philippe,
if you have a mood to try out with Knot DNS, these are the instruction to
reproduce:
1) download Knot DNS sources from https://gitlab.nic.cz/knot/knot-dns . It's OK
to stay at the `master` branch.
2) install building prerequisites. They're listed in README.
3) build it `autoreconf- if && ./configure && make`. It's not necessary to
`make install`.
4) go to `test-extra` subdirectory and install testing prerequisites by `pip3
install -r requirements.txt`
5) in the file `tools/dnstest/params.py` edit the variable `valgrind_flags` to
your liking. At least add `--trace-syscalls=yes`.
6) run `./runtests.py ixfr/redundant_ctl_deny`. Find the valgrind output in
`~/knottest-last/ixfr/redundant_ctl_deny/knot1/valgrind` (where `knottest-last`
is always a symlink somehwere into `/tmp`). Grep for `0xe8` to see if the issue
reproduced.
You will probably need to run the test several time until it reproduces. It may
also happen on some machines that it does never reproduce. For me, it easily
reproduces on my laptop, but hardly on a powerful server.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-10-01 Thread Philippe Waroquiers
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #23 from Philippe Waroquiers  ---
(In reply to Libor Peltan from comment #22)
> (In reply to Philippe Waroquiers from comment #21)
> > Valgrind should stop by itself when it finds an error (when using
> > --vgdb-error argument)
> 
> The error mentioned by me is an error in application logic. Valgrind has no
> way to detect it, no reason to stop in that case and it does not.
> 
> > If yes, as a bypass for this bug, you might try to have valgrind invoking
> > gdbserver and then launch gdb/vgdb, rather than having gdb/vgdb
> > 'interrupting' a (still) running process.
> 
> Yes, we can use a workaround that we don't invoke vgdb at all in such cases.
> But in any case, it would be nice to fix what's going wrong in Valgrind.
> 
> > But in any case, what you are doing should not cause a problem. When I have
> > a little bit of time, I will dig again in the vgdb logic 
> > and see if/where it could create such wrong interaction.
> 
> Thank you for continuous effor in this issue! I tried to create a tiny small
> program that does only listens to incomming UDP packets while using poll or
> epoll_wait syscalls, and frequently attach vgdb to it. Unfortunately, the
> issue did not reproduce in this simple scenario. 
You might need to have a mixture of threads executing some cpu instructions,
some threads
blocked in syscalls, and some signals such as SIGALRM.

> The only reproducer we have
> so far (at least pretty reliable) is the whole Knot DNS and its test-case.
> If you wish, we can give you instructions how to build it and launch the
> test-case, but it'll be several steps to do.
If the instructions are not too long to describe, it will not harm to have
them.
If simple enough and building Knot DNS does not need too much dependencies etc
...,
I can give it a try.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-09-30 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #22 from Libor Peltan  ---
(In reply to Philippe Waroquiers from comment #21)
> Valgrind should stop by itself when it finds an error (when using
> --vgdb-error argument)

The error mentioned by me is an error in application logic. Valgrind has no way
to detect it, no reason to stop in that case and it does not.

> If yes, as a bypass for this bug, you might try to have valgrind invoking
> gdbserver and then launch gdb/vgdb, rather than having gdb/vgdb
> 'interrupting' a (still) running process.

Yes, we can use a workaround that we don't invoke vgdb at all in such cases.
But in any case, it would be nice to fix what's going wrong in Valgrind.

> But in any case, what you are doing should not cause a problem. When I have
> a little bit of time, I will dig again in the vgdb logic 
> and see if/where it could create such wrong interaction.

Thank you for continuous effor in this issue! I tried to create a tiny small
program that does only listens to incomming UDP packets while using poll or
epoll_wait syscalls, and frequently attach vgdb to it. Unfortunately, the issue
did not reproduce in this simple scenario. The only reproducer we have so far
(at least pretty reliable) is the whole Knot DNS and its test-case. If you
wish, we can give you instructions how to build it and launch the test-case,
but it'll be several steps to do.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-09-30 Thread Philippe Waroquiers
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #21 from Philippe Waroquiers  ---
(In reply to Libor Peltan from comment #20)
> Thank you for your observations! Based on this, we actually found out that
> the issue happens exactly (sometimes!) when we attach vgdb to the running
> process, like this:
> 
> ```
> /usr/bin/gdb -ex "set confirm off" -ex "target remote | /usr/bin/vgdb
> --pid=5944" -ex "info threads" -ex "thread apply all bt full" -ex q
> /home/peltan/master_knot/src/knotd
> ```
> 
> I apologize that we overlooked this improtant fact earlier. (Our test
> environment performs this automatically when a routine error occurs.)
> 
> We will continue working on minimizing the reproducer in following days.

That starts to clarify where the problem could originate from.
Valgrind should stop by itself when it finds an error (when using --vgdb-error
argument)
and invoke its gdbserver, waiting for gdb/vgdb to connect.

Do you mean that the above command is launched (somewhat asynchronously) when
an error is detected via other ways ?

If yes, as a bypass for this bug, you might try to have valgrind invoking
gdbserver and then launch gdb/vgdb, rather than having gdb/vgdb
'interrupting' a (still) running process.

But in any case, what you are doing should not cause a problem. When I have a
little bit of time, I will dig again in the vgdb logic 
and see if/where it could create such wrong interaction.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code when vgdb is attached

2022-09-27 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

Libor Peltan  changed:

   What|Removed |Added

Summary|syscall sometimes returns   |syscall sometimes returns
   |its number instead of   |its number instead of
   |return code |return code when vgdb is
   ||attached

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-27 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #20 from Libor Peltan  ---
Thank you for your observations! Based on this, we actually found out that the
issue happens exactly (sometimes!) when we attach vgdb to the running process,
like this:

```
/usr/bin/gdb -ex "set confirm off" -ex "target remote | /usr/bin/vgdb
--pid=5944" -ex "info threads" -ex "thread apply all bt full" -ex q
/home/peltan/master_knot/src/knotd
```

I apologize that we overlooked this improtant fact earlier. (Our test
environment performs this automatically when a routine error occurs.)

We will continue working on minimizing the reproducer in following days.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-26 Thread Philippe Waroquiers
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #19 from Philippe Waroquiers  ---
I took a look at the attached logs.

A first observation:
* We have 2 groups of 3 threads that get the 0xe8 syscall return.
* For each of these 2 groups, we see a little bit before these 0xe8 return that
there is a connection to
   the embedded gdbserver of Valgrind.
Here are the line number and occurrences of the 0xe8 syscall return:
6 matches for "0xe8" in buffer: valgrind
  26734:SYSCALL[61639,20](232) ... [async] --> Success(0xe8) 
  26774:SYSCALL[61639,17](232) ... [async] --> Success(0xe8) 
  26789:SYSCALL[61639,15](232) ... [async] --> Success(0xe8) 
  31141:SYSCALL[61639,14](232) ... [async] --> Success(0xe8) 
  31176:SYSCALL[61639,16](232) ... [async] --> Success(0xe8) 
  31206:SYSCALL[61639,13](232) ... [async] --> Success(0xe8) 

And here are the 3 matches for the gdbserver:
3 matches for "TO DEBUG" in buffer: valgrind
286:==61639== TO DEBUG THIS PROCESS USING GDB: start GDB like this
  26668:==61639== TO DEBUG THIS PROCESS USING GDB: start GDB like this
  31121:==61639== TO DEBUG THIS PROCESS USING GDB: start GDB like this

where the first one is the message produced at startup.
Maybe this is a modified executable that triggers a call to vgdb/gdb  when it
encounters this syscall problem ?
Or is there something that attaches to the valgrind gdbserver or sends a
command to it ?
Because in this last case, we could possibly have an interaction between vgdb
and many threads blocked  in syscalls.
We see in the stderr trace the following:
--61639:2:  gdbsrv   stored register 0 size 8 name rax value 0007
tid 1 status VgTs_WaitSys
--61639:2:  gdbsrv   stored register 0 size 8 name rax value 00e8
tid 15 status VgTs_WaitSys
--61639:2:  gdbsrv   stored register 0 size 8 name rax value 00e8
tid 17 status VgTs_WaitSys
--61639:2:  gdbsrv   stored register 0 size 8 name rax value 00e8
tid 20 status VgTs_WaitSys

--61639:1:  gdbsrv stop_pc 0x4CAC04E changed to be resume_pc 0x4C9CD7F: poll
(poll.c:29)
--61639:2:  gdbsrv   stored register 0 size 8 name rax value 0007
tid 1 status VgTs_WaitSys
--61639:2:  gdbsrv   stored register 0 size 8 name rax value 00e8
tid 13 status VgTs_WaitSys
--61639:2:  gdbsrv   stored register 0 size 8 name rax value 00e8
tid 14 status VgTs_WaitSys
--61639:2:  gdbsrv   stored register 0 size 8 name rax value 00e8
tid 16 status VgTs_WaitSys
--61639:1:  gdbsrv VG core calling VG_(gdbserver_report_signal) vki_nr 15
SIGTERM gdb_nr 15 SIGTERM tid 1
-
So, for the 2 groups of 3 threads that got 0xe8 syscall return, we see that the
valgrind gdbserver was instructed to put 0xe8 
in the rax register.

It is however difficult to relate the stderr output with the valgrind output.
If you could redo the trace with the none tool, but keep together the stderr
and the valgrind output 
(i.e. let valgrind do its output to stderr together with its debug log) + add
--time-stamp=yes
that might help to see what happens in which order.


I have to say that at this state, I have not much idea what other things to
look at.
To further investigate and possibly find the bug,  likely we will need an (easy
to run) reproducer.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-26 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #18 from Libor Peltan  ---
Created attachment 152437
  --> https://bugs.kde.org/attachment.cgi?id=152437=edit
Logs of wrong syscall retvals with no signals observed.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-26 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #17 from Libor Peltan  ---
Thank you much for looking at this issue constantly!

I confirm that with --tool=none, the issue reproduces as well. This is also
true for --tool=helgrind, as I said earlier.

Even with --vex-iropt-register-updates=allregs-at-each-insn , the issue still
reproduces with the same frequency.

I took much closer look at the signals in our application. SIGALRM can be
sometimes sent by the program to itself in order to notify all the threads
about something. However, the issue reproduces even in cases when neither
SIGALRM nor any other signal arrives around the time the issue happens. So it
must be caused (also) by something else.

I'm going to attach the logs from two more different test-cases. In one case,
both epoll and futex return the invalid values (0xe8 and 0xca, respectively),
which results in that assert in glibc and following SIGABRT, which is however
the consequence, not the cause of the issue. In the second case, only epoll
returns wrong 0xe8 value and the program continues running to the end, when
only at the very end SIGTERM is received.

The valgrind params were "--tool=none -v -v -v -d -d -d
--vex-iropt-register-updates=allregs-at-each-insn --trace-syscalls=yes
--trace-signals=yes --trace-sched=yes --num-callers=20 --trace-children=yes
--trace-children-skip=/usr/*sh,/bin/*sh" in both cases.

It seems to me that different threads seem to suffer the same in one moment.
Sometimes the issue happens only once for one thread, sometimes multiple
threads receive the wrong syscall retcodes in close succession. But mostly
without any signal observed, if I'm not overloking something terribly.

Finally, I confirm that even with
--vex-iropt-register-updates=allregs-at-mem-access , the issue still
reproduces.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-24 Thread Philippe Waroquiers
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #16 from Philippe Waroquiers  ---
In one of the trace I see the below trace. It looks like the a signal SIGALRM
is delivered to the thread that encounters the futex 202 result.
--24048-- async signal handler: signal=14, vgtid=24051, tid=4, si_code=-6,
exitreason VgSrc_None
--24048-- interrupted_syscall: tid=4, ip=0x580e687e, restart=False,
sres.isErr=False, sres.val=202
--24048--   at syscall instr: returning EINTR
--24048-- delivering signal 14 (SIGALRM):-6 to thread 4
--24048-- push_signal_frame (thread 4): signal 14
==24048==at 0x4C2C340: futex_wait (futex-internal.h:146)
==24048==by 0x4C2C340: __lll_lock_wait (lowlevellock.c:49)
==24048==by 0x4C32322: __pthread_mutex_cond_lock (pthread_mutex_lock.c:93)
==24048==by 0x4C2E9B3: __pthread_cond_wait_common (pthread_cond_wait.c:616)
==24048==by 0x4C2E9B3: pthread_cond_wait@@GLIBC_2.3.2
(pthread_cond_wait.c:627)
==24048==by 0x14184D: worker_main (pool.c:70)
==24048==by 0x1395B2: thread_ep (dthreads.c:146)
==24048==by 0x4C2FB42: start_thread (pthread_create.c:442)
==24048==by 0x4CC0BB3: clone (clone.S:100)
So, this is another indication that the problem is likely linked to 
VG_(fixup_guest_state_after_syscall_interrupted).
But it is not very clear what is special in your application.

Can you also reproduce the problem with the --tool=none tool ? Or does it
happen only with memcheck ?
Can you check if the problem goes away when using
--vex-iropt-register-updates=allregs-at-each-insn ?
If the problem cannot be reproduced with this setting, can you see if it
reproduces with --vex-iropt-register-updates=allregs-at-mem-access ?

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-23 Thread Paul Floyd
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #15 from Paul Floyd  ---
OK so it looks like a problem with
VG_(fixup_guest_state_after_syscall_interrupted) as described by Philippe

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-19 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #14 from Libor Peltan  ---
Created attachment 152228
  --> https://bugs.kde.org/attachment.cgi?id=152228=edit
Strace failing syscalls from a reproducing scenario.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-19 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #13 from Libor Peltan  ---
(In reply to Paul Floyd from comment #12)
> And do you see any ERESTARTNOINTR with strace?

No. I see only those return error from any failed syscalls:

EAGAIN
ECONNREFUSED
EEXIST
EINPROGRESS
EINTR
EINVAL
ENOENT
ERESTARTSYS

I'm going to attach the strace. Please note, that this strace is not from the
same run as the logs I attached earlier today, but from similar scenario, and
the bug appered same way (just possibly other number of times).

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-19 Thread Paul Floyd
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #12 from Paul Floyd  ---
And do you see any ERESTARTNOINTR with strace?

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-19 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #11 from Libor Peltan  ---
Created attachment 152221
  --> https://bugs.kde.org/attachment.cgi?id=152221=edit
The log of valgrind returning wrong syscall value five times.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-19 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #10 from Libor Peltan  ---
@Philippe Thanks much for your deep analysis! I wouldn't be able to see such
things in conext. However, I think you are not entirely correct. I think the
observed bug is not caused by other thread aborting.  The crash on assert seems
to be the consequence, not the cause of the mess.

I prepared another scenario, where the program does not crash at all. It just
ignores the wrong 0xe8 value returned from epoll and continues as if it's zero.
In this case, it happened five times during the run and the program did not
crash, and operated smoohtly up to the end of test case.

By the way, writing to fd 2 (stderr) is done only by Valgrind, due to the
setting `-d -d -d`.

I'm going to attach those logs, this time also the stderr. I hope this helps to
find the definitive cause :)

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-18 Thread Paul Floyd
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #9 from Paul Floyd  ---
(In reply to Philippe Waroquiers from comment #8)

> So, an hypothesis about what happens:
>   * the application encounters an error condition (in tid 18 in the epoll
> case, in tid 11 in the futex case)
>   * this application thread decides to call abort, generating a fatal signal
>   * valgrind handling of a fatal signal is for sure complex and might still
> be racy, and might not properly reset the guest state
>when the fatal signal has to be handled by a thread doing e.g. epoll or
> futex syscall
> As the guest state is not properly restored, when this thread "resumes"
> and/or due to race conditions, instead of just dying, it continues
> and then itself reports a strange state as the guest thread state was not
> properly restored using a call to 
> VG_(fixup_guest_state_after_syscall_interrupted).

I was thinking on similar lines of https://bugs.kde.org/show_bug.cgi?id=445743
where the syscall is supposed to never return when interrupted. Under Valgrind
it does get interrupted and Valgrind doesn't resume it.

It would be interesting to see if strace shows any ERESTARTNOINTR return status
values.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-17 Thread Philippe Waroquiers
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #8 from Philippe Waroquiers  ---
I took a look at both logs.
First the epoll log.
(tid is the an thread id number used internally in valgrind)

What we see is that the tid 14 is just getting the result of a previous epoll
syscall, and then starts a new epoll syscall:

--111235--   SCHED[14]:  acquired lock (VG_(client_syscall)[async])<
tid 14 acquires the valgrind lock as the epoll syscall ended
SYSCALL[111235,14](232) ... [async] --> Success(0x0)   
 < and the tid 14 reports the success of this syscall
--111235--   SCHED[14]: TRC: SYSCALL   
 < it launches a new epoll syscall
SYSCALL[111235,14](232) sys_epoll_wait ( 18, 0x2610f740, 1, 1000 ) --> [async]
... 
--111235--   SCHED[14]: releasing lock (VG_(client_syscall)[async]) ->
VgTs_WaitSys < and releases the lock waiting for the result
--111235--   SCHED[35]:  acquired lock (VG_(scheduler):timeslice)  
   < and the tid 35 acquires the lock
.

Then later on, the tid 18 calls tgkill sending a (fatal) signal to itself (I
believe it is to itself, the tracing of valgrind of the link 
between the tid and the linux thread id is not very clear).
As this signal is fatal, all threads are being killed by valgrind
We see that a little bit before the tgkill that tid18 does a write on fd 2.
Possibly that is an indication of reporting an error/problem.

The problem with the futex has a similar pattern:
The tid 6 starts a futex syscall and releases the valgrind lock.
Then sometime later, the tid 11 is doing an mmap, and then slightly after
calls tgkill.
And similarly this tid 11 does a write on fd 2 a little bit before.

The processing of a fatal signal in valgrind is quite tricky : complex code,
with race conditions see e.g. bug 409367.

This fatal signal has to get all the threads out of their syscalls.
For this, a kind of internal signal "vgkill"  is sent by the valgrind scheduler
to all threads.
When the signal is received, valgrind detects that the thread was in a syscall
and that the thread
has to "interrupt" the syscall. For this, valgrind calls VG_(post_syscall). But
this post_syscall assumes that the guest state
is correctly "fixed", but I do not see where this is done.

So, an hypothesis about what happens:
  * the application encounters an error condition (in tid 18 in the epoll case,
in tid 11 in the futex case)
  * this application thread decides to call abort, generating a fatal signal
  * valgrind handling of a fatal signal is for sure complex and might still be
racy, and might not properly reset the guest state
   when the fatal signal has to be handled by a thread doing e.g. epoll or
futex syscall
As the guest state is not properly restored, when this thread "resumes" and/or
due to race conditions, instead of just dying, it continues
and then itself reports a strange state as the guest thread state was not
properly restored using a call to 
VG_(fixup_guest_state_after_syscall_interrupted).

To validate this hypothesis, maybe the following could be done:
* check what is this "write on fd 2" doing (maybe with strace?)
* in case the application encounters a problem, instead of calling abort that
sends a fatal signal, you might rather do e.g. sleep(10).
   If the hypothesis is correct, then the thread doing epoll or futex should
just stay blocked in their syscall,
   and the thread detecting the problem will sleep in this state.
   It might then be possible to attach using gdb+vgdb and investigate the state
of the application and/or the valgrind threads.

There is a way to tell valgrind to launch gdbserver in case of an abnormal
valgrind exit using the option:
--vgdb-stop-at=event1,event2,... invoke gdbserver for given events [none]
 where event is one of:
   startup exit valgrindabexit all none

Looks like we might add abexit to ask valgrind to call gdbserver when a client
thread/process does an abnormal exit.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-16 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #7 from Libor Peltan  ---
Created attachment 152095
  --> https://bugs.kde.org/attachment.cgi?id=152095=edit
The log of valgrind crashing knotd after mishandled epoll_wait syscall.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-16 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #6 from Libor Peltan  ---
I frankly don't understand the flags of futex syscall. This is what glibc does.

To possibly bring more light to the issue, I'm going to upload a log from
completely different reproducer.

This time, the affected syscall is epoll_wait. Again, we are facing the bug
causing it to return the number 232 instead proper return value (usually zero,
one or some negative for error).

This actually reproduces quite easily and often, I just had to remove the
workaround from our software. I also let the affected thread wait for one
second before crashing on assert, so that the logging settles down.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-16 Thread Paul Floyd
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #5 from Paul Floyd  ---
Thje pair of entry/exit traces are
SYSCALL[260041,11](202) sys_futex ( 0x1ffefff9f4, 393, 0, 0x2633d890, 0x0 ) -->
[async] ...
SYSCALL[260041,11](202) ... [async] --> Success(0xca)

the futex of is 393 or 0x189
That's PRIVATE, REALTIME_CLOCK TRYLOCK_PI and WAKE

I'm a bit confused as to what TRYLOCK_PI and WAKE do together.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-15 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #4 from Libor Peltan  ---
Created attachment 152081
  --> https://bugs.kde.org/attachment.cgi?id=152081=edit
The log of valgrind crashing glibc by mishandled futex syscall.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-15 Thread Libor Peltan
https://bugs.kde.org/show_bug.cgi?id=458915

--- Comment #3 from Libor Peltan  ---
@Paul
Helgrind reports oh so many errors, I can hardly find any clue in them. I guess
this tool is not intended for debugging (mostly) lockless multi-threaded
programs (where concurrency is controlled by libRCU or other tricks). Anyway,
when I run `valgrind --tool=helgrind`, the described bug appears as well with
roughly equal frequency as for memcheck.

DRD also reports a ton of errors. It's so slow that it doesn't even pass the
test scenario in time limit, so I'm not making any conclusions from it.

Shall I walk through those tons of errors and look for something specific?

I tried applying the patch called "Patch for FreeBSD" from #445743, as well as
the hint from its 14th comment. This did not change anything, the bug persists.

@Philippe
If I hadn't overlooked something, strace does not report any signal received
shortly before the crash, besides the SIGABRT resulting from it. Anyway, it
seems to me, that the actual futex syscall is not in fact called as well, but
I'm not 100% sure.

I'm going to attach the valgrind output from the crashing run of the test, with
the parameters `-v -v -v -d -d -d --trace-syscalls=yes --trace-signals=yes
--trace-sched=yes` as you suggested. It's pretty long, please grep for the word
"abort".

My speculations around RAX are based on the idea that if, due to some bug, the
syscall is not in fact called by valgrind, but expected to have been called,
the syscall number in RAX could be easily misinterpreted to be the return value
in RAX. But this is of course too simple a view.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-14 Thread Philippe Waroquiers
https://bugs.kde.org/show_bug.cgi?id=458915

Philippe Waroquiers  changed:

   What|Removed |Added

 CC||philippe.waroquiers@skynet.
   ||be

--- Comment #2 from Philippe Waroquiers  ---
Does strace show that there are some signals being processed close (in time) to
the system call wrongly returning its syscall number ?
If a signal happens when syscall-amd64-linux.S is being executed, 
then  VG_(fixup_guest_state_after_syscall_interrupted) has some complex logic
that interacts with the partially executed asm code.

Alternatively, having more valgrind tracing might give some hints.
You could try
valgrind -v -v -v -d -d -d --trace-syscalls=yes --trace-signals=yes
your_app

and if your application is multi-threaded (I guess it is), you might also use
--trace-sched=yes

With regards to "What intrigues me that both the syscall number and the return
value appear in the RAX register at some point."
If you speak about the "physical RAX register", then I think this is normal. To
execute a syscall, the syscall number must be set
to the syscall number before the syscall instruction, and on return of the
syscall instruction, the RAX register contains the syscall return value.

When this syscall instruction is finished, the syscall return code (stored by
the kernel in the physical register RAX) 
must be moved to the guest register RAX.

-- 
You are receiving this mail because:
You are watching all bug changes.

[valgrind] [Bug 458915] syscall sometimes returns its number instead of return code

2022-09-13 Thread Paul Floyd
https://bugs.kde.org/show_bug.cgi?id=458915

Paul Floyd  changed:

   What|Removed |Added

 CC||pjfl...@wanadoo.fr

--- Comment #1 from Paul Floyd  ---
Do DRD or Helgrind ever report any errors?

Does the patch in https://bugs.kde.org/show_bug.cgi?id=445743 (plus the change
in comment 14) change anything?

-- 
You are receiving this mail because:
You are watching all bug changes.