[pygr] new subprocess interface pygr.classutil.FilePopen

Christopher Lee Thu, 02 Apr 2009 11:57:54 -0700

Hi,
as a result of the decision to defer thread-safety in 0.8, I switched  
our external subprocess handling to use temporary files.  I created a  
new class pygr.classutil.FilePopen that follows Python's  
subprocess.Popen interface while addressing several problems:


- use of Popen.wait() is prone to deadlocks (see the subprocess docs).

- the recommended read-write interface, Popen.communicate(),  
apparently cannot handle even modest amounts of data (>64k), again  
resulting in deadlocks (there's a veiled warning in the subprocess  
docs; see the following blog page for some real testing).
http://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/

- more broadly, Popen.communicate() forces you to keep the entire data  
to be read and / or written as a single string in memory.  This is not  
suitable for very large data transfers.  Far more flexible is the  
standard model of reading / writing streams to transfer as much data  
as you want without ever having to keep all of it in memory.  Streams  
are an easy, ubiquitous interface that would be compatible with a much  
broader range of existing code (any function that reads / write to a  
file-like object) than Popen.communicate().

- Popen provides stream attributes (Popen.stdin, Popen.stdout,  
Popen.stderr), but the docs say that using them directly will result  
in deadlocks.  (That is probably why they try to make you use  
Popen.communicate()).  Also these attributes are only set if that  
stream is a pipe, not if it is a regular file.  The bottom line is  
that you can't use these stream attributes reliably, so why does the  
subprocess module present them as part of the public interface?

- The subprocess module is not available on Python 2.3, which is a  
problem for Pygr.  The older interfaces (os.popen, popen2, os.system  
etc.) have their own problems (the same deadlock issues; vulnerability  
to shell injection attacks, etc.).  Even worse, we were already  
propagating a pattern throughout Pygr of writing duplicate code for  
every external process interaction: once using subprocess, and again  
using older interfaces for Python 2.3.  That way lies madness.

- all of these interfaces seem to force people to use multiple  
threading as the only way of evading deadlocks.  That seems like an  
overly complicated solution, if all you want to do is pass a bunch of  
data to an external program, then block until it returns all its  
results (i.e. if you are "calling" the external program like a  
subroutine).  Most code in the world (including Pygr 0.8) is not truly  
thread-safe, so invoking multiple threading is either asking for  
trouble, or drastically limits what code resources you can use.

I have implemented the obvious solution:
- use temporary files in place of pipes for a subprocess' stdin,  
stdout, and stderr.

- Instead of using Popen.communicate() to read / write, you simply  
write to Popen.stdin, then call Popen.wait() to let the subprocess  
finish, and finally read from Popen.stdout (and if you wish,  
Popen.stderr).

- when the Popen object is garbage collected, the temporary files are  
automatically deleted from the file system.

- because you are simply reading / writing to a file (not a pipe to a  
subprocess), deadlocks are not possible, and there is no need for  
multi-threading.

- this implementation uses the standard Python subprocess module if  
available, but in its absence (on Python 2.3) transparently provides  
the same functionality using older interfaces.  (Since all  
interprocess communication is via files, a simple redirect of stdin /  
stdout is all that's required.  Currently this doesn't support stderr  
redirection, but we could add that.)  In this case each argument is  
"shell-escaped" against shell-injection attacks using  
commands.mkarg().  The older interface doesn't currently handle  
fancier aspects of subprocess's arguments (like change the child's  
environment); currently, those arguments are silently ignored.

The new interface is called pygr.classutil.FilePopen, which mimics the  
subprocess.Popen interface.  There is also a convenience function  
pygr.classutil.call_subprocess(), which mimics the subprocess.call()  
function.  Some details:

- if you specify the keyword argument stdin, stdout or stderr to be  
pygr.classutil.PIPE (same as subprocess.PIPE), it connects that stream  
to a temporary file (created and deleted automatically) instead of a  
pipe.

- if you specify the argument stdinFlag, it appends that plus the  
stdin filename to the command line arguments for invoking the  
subprocess.  This is useful for programs like megablast that  
apparently will only read from a filename argument given on the  
command line, and not from stdin.  The stdinFlag switches the behavior  
of the stdin stream from being passed as stdin-redirection to being  
simply appended as a command line argument.  For example, stdinFlag='- 
i' would cause the following to be appended to the command line  
arguments: -i /path/to/temp/file.  If you pass stdinFlag=None, *only*  
the stdin filename is appended to the command line arguments (with no  
preceding flag argument).

- same thing for stdoutFlag

- use the attributes FilePopen.stdin, FilePopen.stdout,  
FilePopen.stderr for reading / writing from / to the subprocess.   
These attributes are present and usable regardless of whether they  
were passed as PIPE or as a regular file.  Any attribute not passed  
PIPE or a file object will be None.

- basic usage pattern: 1. create the FilePopen object; 2. write all  
its input to FilePopen.stdin; 3. call FilePopen.wait() to wait until  
the subprocess finishes; 4. if desired read from FilePopen.stdout /  
stderr to get its output; 5. releasing the FilePopen object will cause  
any temporary files (i.e. created with the PIPE argument) to be  
deleted from the file system.  You don't need to flush() or rewind any  
of the streams; FilePopen.wait() does that for you.

- See usage examples in pygr/downloader.py and pygr/blast.py

- like Popen.wait(), FilePopen.wait() returns the return code of the  
subprocess.  Note that the subprocess is not actually started until  
you can FilePopen.wait().

- most other aspects of the subprocess.Popen interface (such as  
communicate()) are not implemented on FilePopen, because they aren't  
relevant.

All existing usages in Pygr of subprocess / popen and all but a few of  
the os.system calls have been switched to FilePopen / call_subprocess.

All of this was pushed to master around 10 pm last night.  Get it from  
the public repo.or.cz repository.  I've tested so far on Python 2.5.2  
and Python 2.6.1 (Mac OS 10.5.6) and Python 2.3.3 (linux 2.4.20), and  
it seems to have passed last night's megatests fine as well.

Yours,

Chris

On Mar 31, 2009, at 9:42 PM, C. Titus Brown wrote:

>
> On Tue, Mar 31, 2009 at 09:42:00PM -0700, Alexander Alekseyenko wrote:
> -> In my opinion we should make a note that pygr 0.8 is not thread  
> safe,
> -> use the temporary file solution for now and put thread safety on  
> a to
> -> do list for 1.0.
>
> +1
>
> cheers,
> --titus

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"pygr-dev" group.
To post to this group, send email to pygr-dev@googlegroups.com
To unsubscribe from this group, send email to 
pygr-dev+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/pygr-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

[pygr] new subprocess interface pygr.classutil.FilePopen

Reply via email to