anirudh2290 commented on issue #13438: libc getenv is not threadsafe
URL: 
https://github.com/apache/incubator-mxnet/issues/13438#issuecomment-444741884
 
 
   The problem as provided in the article linked in this issue and related 
article here: https://rachelbythebay.com/w/2017/01/30/env/ is that if the main 
thread spawns another thread, which calls setenv and while we call setenv the 
process is forked, the mutex is currently in locked state in the child process 
and it will never be unlocked since there is no thread to release the lock 
which causes it to hang. 
   This can be replicated in MXNet in the following way. Pull the code from 
https://github.com/anirudh2290/mxnet/tree/setenv_issue and build it similar to 
the following:
   ```
   cd build && cmake VERBOSE=1 -DUSE_CUDA=ON -DUSE_CUDNN=ON -DUSE_MKLDNN=ON 
-DUSE_OPENMP=ON -DUSE_OPENCV=OFF -DCMAKE_BUILD_TYPE=Debug -GNinja ..
   ```
   
   Run the following script:
   
   ```
     import multiprocessing
     import os
     import sys
     import mxnet as mx
   
     def mxnet_worker():
          print 'inside mxnet_worker'
   
     mx.base._LIB.MXStartBackgroundThread(mx.base.c_str("dummy"))
     read_process = [multiprocessing.Process(target=mxnet_worker) for i in 
range(8)]
     for p in read_process:
         p.daemon = True
         p.start()
         p.join()
   ```
   
   Now run the script, you will be able to see the process hangs.
   When I attach gdb to the process I see the following:
   
   ```
   #0  __lll_lock_wait_private () at 
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
   #1  0x00007fc0fabab99c in __add_to_environ (name=0x7fc093a935fc 
"MXNET_CPU_WORKER_NTHREADS", value=0x7fffec2eff10 "1", combined=0x0,
       replace=1) at setenv.c:133
   ```
   
   which means it is stuck trying to acquire the lock: 
https://github.com/lattera/glibc/blob/master/stdlib/setenv.c#L133
   
   I checked the mxnet codebase to see if we are calling SetEnv anywhere else 
and we dont seem to be calling it anywhere except here. Also, pthread_at_fork 
statement calls `Engine::Get()->Stop()` which would mean that all engine 
threads are suspended. It is still possible that it could be called from other 
multithreaded code in MXNet iterators for example, but I couldnt find it and it 
is unlikely that we are not using dmlc::SetEnv but something else to set env 
vars for mxnet or dmlc-core code. I think it is more likely that the customer 
application spawned a thread, which called `SetEnv` at the same time 
pthread_at_fork was called which let to this behavior. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to