zhengruifeng opened a new pull request, #40793:
URL: https://github.com/apache/spark/pull/40793

   ### What changes were proposed in this pull request?
   `TorchDistributorLocalUnitTestsOnConnect` and 
`TorchDistributorLocalUnitTestsIIOnConnect` were not stable and occasionally 
got stuck. However, I can not reproduce the issue locally. So they had been 
disabled.
   
   This PR is to reenable them, I found that the old tests for Torch set up the 
connect sessions in `setUp` and close them in `tearDown`, however such session 
operations are expensive and we should use `setUpClass` and `tearDownClass` 
instead. After this change, the related tests seems much stable. So I think the 
root cause is still related to the resources, since TorchDistributor works on 
barrier mode, when there is not enough resources in Github Action, the tests 
just keep waiting.
   
   
   ### Why are the changes needed?
   for test coverage
   
   
   ### Does this PR introduce _any_ user-facing change?
   Reenable `TorchDistributorLocalUnitTestsOnConnect` and 
`TorchDistributorLocalUnitTestsIIOnConnect`
   
   ### How was this patch tested?
   CI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to