w1049 commented on issue #18157:
URL: https://github.com/apache/tvm/issues/18157#issuecomment-3121137934

   > Thanks [@w1049](https://github.com/w1049) this is very interesting! Do you 
mind also create a minimum example that just uses `PopenPoolExecutor`. Your 
suggested temp fix works, please send a PR, we should also document this case 
in the shutdown() function
   
   This is an example that just uses `PopenPoolExecutor`.
   ```python
   from tvm.contrib.popen_pool import PopenPoolExecutor, StatusKind
   import sys
   import gc
   
   
   def func(x):
       if x == 0:
           return x
       raise ValueError("This is a test error")
   
   
   while True:
       pool = PopenPoolExecutor()
   
       for map_result in pool.map_with_error_catching(
           lambda x: func(x),
           range(2),
       ):
           if map_result.status == StatusKind.COMPLETE:
               print(f"Completed with {map_result.value}")
           elif map_result.status == StatusKind.EXCEPTION:
               print(f"Exception raised: {map_result.value}")
           else:
               print(f"Unexpected status: {map_result.status}")
   
       print("Finished, trying to delete pool...")
       print("Ref count:", sys.getrefcount(pool))
       print("Referrers:", gc.get_referrers(pool))
   
       del pool # decrement the reference count
       print("After `del pool'")
   ```
   It demostrates the following scenario in python documentation: when an 
exception occurs in a worker function, `del` cannot immediately delete the pool.
   > CPython implementation detail: It is possible for a reference cycle to 
prevent the reference count of an object from going to zero. In this case, the 
cycle will be later detected and deleted by the [cyclic garbage 
collector](https://docs.python.org/3/glossary.html#term-garbage-collection). A 
common cause of reference cycles is when an exception has been caught in a 
local variable. The frame’s locals then reference the exception, which 
references its own traceback, which references the locals of all frames caught 
in the traceback.
   
   Thus the conditions for this deadlock are:
   - an exception occurs during a build, typically due to an invalid config
   - in the next build, when a newly created pool is in the function 
`_maintain_shutdown_locks()`, GC happens to clean up the previous pool and 
invokes `shutdown()`.
   
   I will send a PR soon.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to