[issue40110] multiprocessing.Pool.imap() should be lazy

Tim Peters Fri, 03 Apr 2020 14:21:07 -0700


Tim Peters <t...@python.org> added the comment:


Whenever there's parallel processing with communication, there's always the 
potential for producers to pump out data faster than consumers can process 
them.  But builtin primitives generally don't try to address that directly.  
They don't - and can't - know enough about the application's intent.

Instead, as I briefly alluded to earlier, mediation (when needed) is frequently 
accomplished by users by explicit use of bounded queues.  When process A 
produces data for process B, it sends the data over a bounded queue.  Nothing 
is done to slow A down, except that when a bounded queue is full, an attempt to 
enqueue a new piece of data blocks until B removes some old data from the 
queue.  That's a dead easy, painless, and foolproof way to limit A's speed to 
the rate at which B can consume data.  Nothing in A's logic changes - it's the 
communication channel that applies the brakes.

I'll attach `pipe.py` to illustrate.  It constructs a 10-stage pipeline.  The 
first process in the chain *could* produce data at a ferocious rate - but the 
bounded queue connecting it to the next process slows it to the rate at which 
the second process runs.  All the other processes in the pipeline grab data 
from the preceding process via a bounded queue, work on it for a second (just a 
sleep(1) here), then enqueue the result for the next process in the pipeline.  
The main program loops, pulling data off the final process as fast as results 
show up.

So, if you run it, you'll see that new data is produced once per second, then 
when the pipeline is full final results are delivered once per second.  When 
the first process is done, results continue to be pulled off one per second 
until the pipeline is drained.

The queues here have maximum size 1, just to show that it works.  In practice, 
a larger bound is usually used, to allow for that processes in real life often 
take varying amounts of time depending on the data they're currently 
processing.  Letting a handful of work items queue up keeps processes busy - if 
processing some piece of data goes especially fast, fine, there may be more in 
the input queue waiting to go immediately.  How many is "a handful"?  Again, 
that's something that can't in general be guessed - the programmer has to apply 
their best judgment based on their deep knowledge of intended application 
behavior.

Just to be cute ;-) , the code here uses imap() to start the 10 worker 
processes.  The result of calling imap() is never used, it's just an easy way 
to apply a Pool to the task.  And it absolutely relies on that imap() consumes 
its iterable (range(NSTAGES)) as fast as it can, to get the NSTAGES=10 worker 
processes running.

----------
Added file: https://bugs.python.org/file49032/pipe.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue40110>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue40110] multiprocessing.Pool.imap() should be lazy

Reply via email to