YutingZhang opened a new pull request #13318: Improving multi-processing 
reliability for gluon DataLoader
URL: https://github.com/apache/incubator-mxnet/pull/13318
 
 
   I found some multi-processing-related issues in the Gluon  DataLoader.
   
    1) Each time a _MultiWorkerIter shuts down, it could leave some dangling 
processes. The shutdown mechanism could not guarantee that all worker processes 
can be terminated. As a result, after running for several epochs, more and more 
dangling processes will stay there.
   
     This problem barely happens during training. In this case, there is a 
decent time interval between the last-batch data prefetching and the 
_MultiWorkerIter's shutting down).
     But the problem frequently happens 1) when I stop the iter before the end 
of an epoch, and 2) when I use the DataLoader for a data loading service and 
load the data as fast as possible. In both cases, the time interval between the 
most recent data prefetching and the iter shutdown are short. I guess that the 
_MultiWorkerIter iter is unable to shut down properly during active data 
prefetching.
   
     To fix this, I explicitly terminate the worker processes inside the 
shutdown function.
   
     2) When loading data fast (still mostly during testing and data serving), 
there seems to be a risk of data racing. The data iter uses a _MultiWorkerIter 
to cache prefetched data, but the dict does not seem to be thread-safe for 
concurrent inserting and deleting elements. So occasionally, the data can be 
missing from the  dict.
   
     To prevent this, I use a scope lock to guard the dict access.
   
   ## Description ##
   (Brief description on what this PR is about)
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [ ] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to 
the relevant [JIRA issue](https://issues.apache.org/jira/projects/MXNET/issues) 
created (except PRs with tiny changes)
   - [ ] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding 
a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing 
distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a 
new build option with NCCL)
   - [ ] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments 
are documented. 
   - For new examples, README.md is added to explain the what the example does, 
the source of the dataset, expected performance on test set and reference to 
the original paper if applicable
   - Check the API doc at 
http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
   - [ ] To the my best knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [ ] Feature1, tests, (and when applicable, API doc)
   - [ ] Feature2, tests, (and when applicable, API doc)
   
   ## Comments ##
   - If this change is a backward incompatible change, why must this change be 
made.
   - Interesting edge cases to note here
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to