Hi, Summary: * In my organization, we have a *multi threaded* (threading library) python (python 2.4.1) daemon on Linux, which starts up various processes using the fork pipe exec model. * We use this fork , wait on pipe , exec model as a form of handshake between the parent and child processes. We want the child to go ahead only after the parent has noted down the fact that the child has been forked and what it's pid is. * This usually works fine, but for about 1 in every 20,000 processes started, the child process just freezes somewhere after the fork, before the exec. It does not die. It is alive and stuck. * Why does this happen? * Is there a better way for us to write a fork-wait_for_start_signal-exec construct?
Here is what we do: One of the threads of the multi threaded python daemon does the following 1) Fork out a child process 2) Child process waits for a pipe message from parent --- Parent sends pipe message to child after noting down child details : pid, start time etc. --- 3) Child process prints various debug messages, including looking at os.environ values 4) Child process execs the right script Here it is again, in pseudo code: def start_job(): read_pipefd, write_pipefd = os.pipe() # 1) Fork out a child process pid = os.fork() if pid == 0: # 2) wait for excepted message on pipe os.close(write_pipefd) read_set, write_set, exp_set = select.select([read_pipefd], [], [], 300) if os.read(read_pipefd, len("expected message") <> "expected message": os._exit(1) os.close(read_pipefd) # 3) print various debug messages, including os.environ values print >> sys.stderr, "we print various debug messages here, including os.evniron values" # 4) go ahead with exec os.execve(path, args, env) else: # parent process sends pipe message to child at the right time The problem: * Things work fine most of the time, but rarely, the process gets "stuck" after fork, before exec (In steps 2 or 3 above). Process makes no progress and does not die either. * When I do a gdb (gdb 6.5) attach on the process, bt fails as follows: (gdb) bt #0 0x00002ba9fd5c6a68 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0 #1 0x00002ba9fd5c2a78 in _L_mutex_lock_106 () from /lib64/libpthread.so.0 dwarf2-frame.c:521: internal-error: Unknown CFI encountered. A problem internal to GDB has been detected, further debugging may prove unreliable. Quit this debugging session? (y or n) I looked into this error and found that pre-6.6 gdb throws this error when looking at the stack trace of a deadlocked process. This is certainly not a dead lock in my code as there is no locking involved in this area of code. * This problem happens for about 1 process in every 20,000. This statistics is gathered across about 80 machines in our cluster, so its not the case of a single machine having some hardware issue. * Note that the child is forked out by a *multi threaded* python application. I noticed some forums discussing how multi threaded (pthreads library) processes doing things between a fork and an exec can rarely get into a deadlock. I know that python ( atleast 2.4.1 ) multi threading does not use pthreads, but probably the python interpreter itself does use pthreads? Questions: * Why does this happen? * Is there a better way for us to write a fork-wait_for_start_signal-exec construct in a multi threaded application? Thanks, Gangadharan
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com