anirudh2290 commented on issue #13438: libc getenv is not threadsafe URL: https://github.com/apache/incubator-mxnet/issues/13438#issuecomment-444741884 The problem as provided in the article linked in this issue and related article here: https://rachelbythebay.com/w/2017/01/30/env/ is that if the main thread spawns another thread, which calls setenv and while we call setenv the process is forked, the mutex is currently in locked state in the child process and it will never be unlocked since there is no thread to release the lock which causes it to hang. This can be replicated in MXNet in the following way. Pull the code from https://github.com/anirudh2290/mxnet/tree/setenv_issue and build it similar to the following: ``` cd build && cmake VERBOSE=1 -DUSE_CUDA=ON -DUSE_CUDNN=ON -DUSE_MKLDNN=ON -DUSE_OPENMP=ON -DUSE_OPENCV=OFF -DCMAKE_BUILD_TYPE=Debug -GNinja .. ``` Run the following script: ``` import multiprocessing import os import sys import mxnet as mx def mxnet_worker(): print 'inside mxnet_worker' mx.base._LIB.MXStartBackgroundThread(mx.base.c_str("dummy")) read_process = [multiprocessing.Process(target=mxnet_worker) for i in range(8)] for p in read_process: p.daemon = True p.start() p.join() ``` Now run the script, you will be able to see the process hangs. When I attach gdb to the process I see the following: ``` #0 __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007fc0fabab99c in __add_to_environ (name=0x7fc093a935fc "MXNET_CPU_WORKER_NTHREADS", value=0x7fffec2eff10 "1", combined=0x0, replace=1) at setenv.c:133 ``` which means it is stuck trying to acquire the lock: https://github.com/lattera/glibc/blob/master/stdlib/setenv.c#L133 I checked the mxnet codebase to see if we are calling SetEnv anywhere else and we dont seem to be calling it anywhere except here. Also, pthread_at_fork statement calls `Engine::Get()->Stop()` which would mean that all engine threads are suspended. It is still possible that it could be called from other multithreaded code in MXNet iterators for example, but I couldnt find it and it is unlikely that we are not using dmlc::SetEnv but something else to set env vars for mxnet or dmlc-core code. I think it is more likely that the customer application spawned a thread, which called `SetEnv` at the same time pthread_at_fork was called which let to this behavior.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
