Carsten,

thanks for the detailed analysis.

There is definitely a cost to getpid() in some circumstances. The worst case I can imagine is on code like the one below that looks over OGRCoordinateTransformation::Transform() many times. For each call to that function, there's a OSRGetProjTLSContext() call which involves getpid() when pthread_atfork is not used. The overhead is ~ 15% with that reproducer, so definitely measurable.

In https://github.com/OSGeo/gdal/pull/8909, I've taken into account your suggestion to call pthread_atfork() only once.  This is true that the detection might not work well with threads, but :

- generally mixing fork() and threads is a very bad idea as this leads to many issues like mutexes that would be taken by a thread that is not the one forked being let in a state where the child process can't acquire them anymore. So the safer practice is that you use either threading , or fork() your main process quite at its beginning to have child processes

- but even if you have let's say 2 threads A and B, and A calls fork(). Then the child process will only have a main thread at its creation, so it will inherit only from the one TLS instance of OSRPJContextHolder duplicated from A. Hence having a global g_bForkOccurred shouldn't be an issue. At least I believe... Anyway as mentioned in my previous point, mixing both paradigm is generally a bad idea.

Even

#include "ogr_spatialref.h"

int main(int argc, char **argv) {
    OGRSpatialReference srs1;
    srs1.importFromEPSG(4326);
    OGRSpatialReference srs2;
    srs2.importFromEPSG(32631);
    auto poCT = OGRCreateCoordinateTransformation(&srs1, &srs2);
    for(int i = 0; i < 5 * 1000 * 1000; ++i)
    {
        double dfX = 49;
        double dfY = 2;
        poCT->TransformWithErrorCodes(1, &dfX, &dfY, nullptr, nullptr, nullptr);
    }
    delete poCT;
    return 0;
}

Le 04/12/2023 à 12:21, Carsten Schultz via gdal-dev a écrit :
Hello everyone!

Please let me know if you would have preferred to discuss this on Github.

I have run into a problem with the use of pthread_atfork() in ogr_proj_p.cpp. I 
have since seen that the code has been disabled for macOS in 3.8 in response to 
https://github.com/OSGeo/gdal/issues/8497, and indeed I encountered the problem 
in 3.6.4 on macOS. The problems are however more general in my opinion.


I have two concerns.

1) The call `pthread_atfork(nullptr, nullptr, ForkOccurred)` install a handler 
which sets a flag in a global variable in a forked child. This handler should 
be installed exactly once. (Indeed there is no provision to remove such a 
handler.) However, the call is done in the constructor of OSRPJContextHolder 
and since there is a thread local instance of this, that constructor can be 
called an unlimited number of times.

2) The static flag g_bForkOccurred that is set by ForkOccured is used to signal 
that a fork has occurred and OSRPJContextHolder will clear that flag when it 
acts upon it. This means that at most one OSRPJContextHolder instance will 
observe the flag. That is probably not a problem as long as there is only one 
static or thread_local instance of OSRPJContextHolder, but it doesn’t look 
conceptually clean to me.


To see why (1) is a problem (except for the handler running several thousand 
time should a fork actually occur), there may be a limit to how many handlers 
can be actually registered, and pthread_atfork will return E_NOMEM if this 
limit is exceeded. Now since the code doesn’t check the return code and 
shouldn’t have registered the handler multiple times anyway, this will usually 
not be noticed. However, an unrelated part of the program may try to use 
pthread_atfork after GDAL has taken up all slots and react less nonchalantly to 
a failure. This is what happened to me. This is most likely the reason if a 
problem is noticed on macOS and not Linux, because on macOS the limit can be as 
low as 170 calls to pthread_atfork (4096 / 24, vm page size / size of 3 
pointers) while I haven’t encountered such a limit on Linux.

Additionally pthread_atfork() is most likely not thread safe, and it is called 
without synchronisation.


Now all of this is of course easy to fix, but at first I wanted to ask whether 
you consider it actually worth the effort. If the getpid() version works fine, 
is that one system call expensive enough to warrant the effort of maintaining 
the pthread_atfork code?

Thanks for your time,

Carsten
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

--
http://www.spatialys.com
My software is free, but my time generally not.

_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to