Nick Guenther <[email protected]> added the comment:
Thank you for taking the time to consider my points! Yes, I think you
understood exactly what I was getting at.
I slept on it and thought about what I'd posted the day after and realized most
of the points you raise, especially that serialized next() would mean
serialized processing. So the naive approach is out.
I am motivated by trying to introduce backpressure to my pipelines. The example
you gave has potentially infinite memory usage; if I simply slow it down with
sleep() I get a memory leak and the main python proc pinning my CPU, even
though it "isn't" doing anything:
with multiprocessing.Pool(4) as pool:
for i, v in enumerate(pool.imap(worker, itertools.count(1)), 1):
time.sleep(7)
print(f"At {i}: {v}, memory usage is {sys.getallocatedblocks()}")
At 1->1, memory usage is 230617
At 2->8, memory usage is 411053
At 3->27, memory usage is 581439
At 4->64, memory usage is 748584
At 5->125, memory usage is 909694
At 6->216, memory usage is 1074304
At 7->343, memory usage is 1238106
At 8->512, memory usage is 1389162
At 9->729, memory usage is 1537830
At 10->1000, memory usage is 1648502
At 11->1331, memory usage is 1759864
At 12->1728, memory usage is 1909807
At 13->2197, memory usage is 2005700
At 14->2744, memory usage is 2067407
At 15->3375, memory usage is 2156479
At 16->4096, memory usage is 2240936
At 17->4913, memory usage is 2328123
At 18->5832, memory usage is 2456865
At 19->6859, memory usage is 2614602
At 20->8000, memory usage is 2803736
At 21->9261, memory usage is 2999129
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11314 kousu 20 0 303308 40996 6340 S 91.4 2.1 0:21.68 python3.8
11317 kousu 20 0 54208 10264 4352 S 16.2 0.5 0:03.76 python3.8
11315 kousu 20 0 54208 10260 4352 S 15.8 0.5 0:03.74 python3.8
11316 kousu 20 0 54208 10260 4352 S 15.8 0.5 0:03.74 python3.8
11318 kousu 20 0 54208 10264 4352 S 15.5 0.5 0:03.72 python3.8
It seems to me like any usage of Pool.imap() either has the same issue lurking
or is run on a small finite data set where you are better off using Pool.map().
I like generators because they keep constant-time constant-memory work, which
seems like a natural fit for stream processing situations. Unix pipelines have
backpressure built-in, because write() blocks when the pipe buffer is full.
I did come up with one possibility after sleeping on it again: run the final
iteration in parallel, perhaps by a special plist() method that makes as many
parallel next() calls as possible. There's definitely details to work out but I
plan to prototype when I have spare time in the next couple weeks.
You're entirely right that it's a risky change to suggest, so maybe it would be
best expressed as a package if I get it working. Can I keep this issue open to
see if it draws in insights from anyone else in the meantime?
Thanks again for taking a look! That's really cool of you!
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue40110>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com