Guillaume Perrault-Archambault <gperr...@uottawa.ca> added the comment:
Hi Victor and Yang, Thanks for your fast replies. I did initially think it could be a torch issue. Indeed, I have an equivalent numpy testcase that does not deadlock. However, the fact that it gets stuck inside a multiprocessing wait statement makes me think it's still a multiprocessing issue. I've spent two weeks full time on this issue. Over at torch forums I've had no replies ( https://discuss.pytorch.org/t/multiprocessing-code-works-using-numpy-but-deadlocked-using-pytorch/20473 ). On stackexchange I only got a workaround suggestion that works sporadically ( https://stackoverflow.com/questions/51093970/multiprocessing-code-works-using-numpy-but-deadlocked-using-pytorch). Basically I can get rid of the deadlock (sometimes) if I impose only one thread per process. But this is not a solution anyway. I have tried stepping through the code, but because it is multiprocessed, you cannot step through it (at least not in the conventional way, since the main thread is not doing the heavy lifting). I've tried adding print statements in the multiprocess library and mucking around with it a bit, but debugging multi-processed code in this way is an absolute nightmare because you can't even trust the order in which print statements display on the screen. And probably more relevant, I'm out of my league here. I'm really at a complete dead end. I'm blocked and my work cannot progress without fixing this issue. I'd be very grateful if you could try to reproduce and rule out the multiprocessing library. If you need help reproducing I can send a different testcase that deadlocked on my friend's Mac (for him, the original testcase did not deadlock). Testcase I attached in my original post it sometimes deadlocks and sometimes doesn't, depending on the machine I run on. So I'm not suprised you got no deadlock when you tried to reproduce. I can always get it deadlocking on Linux/Mac though, by tweaking the code. To give you a sense of how unreliably it deadlocks, just removing the for loop in the code (which is outside the multiprocessing portion of the code!) somehow gets rid of the deadlock. Also, it never deadlocks on Windows. If you could provide any help on this issue I'd be very grateful. Regards, Guillaume. On Fri, Jul 6, 2018 at 11:21 AM STINNER Victor <rep...@bugs.python.org> wrote: > > STINNER Victor <vstin...@redhat.com> added the comment: > > IMHO it's an issue with your usage of the torch module which is not part > of the Python stdlib, so I suggest to close this issue as "third party" or > "not a bug". > > ---------- > nosy: +vstinner > > _______________________________________ > Python tracker <rep...@bugs.python.org> > <https://bugs.python.org/issue34059> > _______________________________________ > ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue34059> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com