Hi Christian,

On Fri, 2018-02-02 09:27:59 +0100, Christian Grothoff <groth...@gnunet.org> 
> This is very strange, the loop should not fix this, as pthread_join
> should simply block (not race!) until the thread is done.  In fact, I
> generally think the right answer to ESRCH would be to die, as to me this
> indicates some kind of memory corruption or other severe invariant
> violation.

Your assumption though doesn't match my experience. Just calling
pthread_join() again after a delay of 10msec did the job. (I placed
the loop to give it a few more tries if needed.)

> Now, given that you mentioned changes related to popen()-logic in your
> own code, I wonder if the change in your _application_ logic related to
> fork() may be interoperating badly with threads.  In particular, after
> you fork(), all of the "other" threads will be gone, so if you fork()
> and then continue any MHD-interaction related to the threads spawned by
> MHD, that is likely to be, eh, problematic --- and may show up with an
> ESRCH.  However, that doesn't quite explain to me why putting this in a
> loop with sleeps might fix it. (But I don't know enough about your code.)

The code in use is https://github.com/famzah/popen-noshell ,
with a small wrapper to really make it look like popen()/pclose(),
which simply puts the neede clean-up struct into a hash table with the
fp as its key. pclose() then uses the fp to recover the clean-up
struct pointer to be supplied to the simplified pclose() variant.

> Regardless, the loop/sleep is a very, very wrong fix, and I strongly
> suspect the problem is in your code (or how you use the MHD API, in
> conjunction with fork()).

You're completely right here. I wrote some more small test programs,
and I observe two things:

  * pthread_join() indeed waits as promised in my tests; and
  * I cannot reproduce that non-waiting / failing behavior with any of
    my test attempts so far.

I did, however, find _one_ similar bug report, where pthread_join()
failed in a similar way:

Unfortunately, the provided testcase is incorrect (see my comment
there) and and this bug report wasn't ever finished, so I don't know
if the bad testcase exists similarly in their original application.
However, since libmicrohttpd just ignores the thread's result (passing
a NULL pointer to pthread_join(), in conjunction with my observation
that it will work on a second try), I'm quite confident that something
different is happening here.

  Looping until success (limited to a small justificable timespan)
isn't a correct fix of course. And indeed, pthread_join() probably
should wait, so I'm off again trying to find out in which situations
this couldn't happen.


Getslash GmbH, Hermann-Johenning-Platz 2, 59302 Oelde
Tel: +49-2522-834349-5    Fax:   +49-2522-834349-1
http://www.getslash.de    Mobil: +49-152-33822499
Sitz der Gesellschaft: Oelde
Handelsregister: Amtsgericht Münster, HRB 11911
Ust-Id-Nr.: DE 815060326
Geschäftsführung: Andre Peitz, Tobias Hanisch

Attachment: signature.asc
Description: PGP signature

Reply via email to