zhreshold opened a new issue #17269: [mxnet 2.0][item 4.8][RFC] Gluon Data API 
Extension and Fixes(Part 2)
URL: https://github.com/apache/incubator-mxnet/issues/17269
 
 
   ## Description
   This is the part 2 of Gluon Data API extension and fixes, which mainly focus 
on speed up the current data loading pipeline using gluon dataset and 
dataloader.
   
   ## Motivation
   
   The current data loading pipeline is the major bottleneck for many training 
tasks. We can summarize the entire flow as:
   
   ```bash
   | Dataset.__getitem__ -> 
   | Transform.__call__()/forward() ->
   | Batchify ->
   | (optional communicate through shared_mem) ->
   | split_and_load(ctxs) ->
   | <training on GPUs>
   -> 
   ```
   where there are performance concerns:
   - performance of python dataset/transform functions aren't satisfying
   - it's not easy to embrace multithreading to speed up dataloading due to 
global interpreter lock
   - python multiprocessing is unfortunately slow and error prune, not to 
mention the shared memory implementations on different OS are quite difference 
and very annoying(e.g., it's very likely to run out of shared memory if not 
properly taken care of)
   - currently memory planing for batchify is non-exist, causing frequent 
alloc/dealloc for large chunk of memory if the batch size is big
   - batchify then split and load can be optimized to partial_batchify
   
   ## Proposal
   To alleviate the existing troubles I propose to use a hybrid solution, that 
is to 
   - provide C++ Datasets that can cover the most usecases
       ```python
       from gluon.data.dataset import TupleDataset, ImageFolderDataset, 
ArrayDataset
       # as long as TupleDataset, ImageSequenceDataset, ArrayDataset are 
supported by backend
       dataset = TupleDataset([ImageSequenceDataset(img_paths), 
ArrayDataset(image_labels)])
       # dataset is an image classification dataset while fully supported in C++
       # with TupleDataset we can combine as many data as possible
   
       # a C++ backed Dataset can have a magic __handle__ method to return the 
c++ handle for reference
       class TupleDataset:
           def __init__(self, datasets):
               if all([callable(getattr(dataset, '__handle__')) for dataset in 
datasets]):
                   # all supported by backend
                   self._tuple_dataset = 
check_call(_LIB.MXTupleDatasetCreate([getattr(dataset, '__handle__') for 
dataset in datasets]))
               else:
                   self._tuple_dataset = None
   
               def __handle__(self):
                   return self._tuple_dataset
                       
       ```
   - provide common C++ batchify functions that are split and context aware. 
Batchify with memory planner is TBD.
   - provide a C++ `MultithreadingDataLoader` which inherit the same arguments 
as `gluon.data.DataLoader` but use mxnet internal multithreading rather than 
python multiprocessing.
   - fallback to python multiprocessing whenever 
       - the dataset is not fully supported by backend(e.g., there are custom 
python datasets)
       - Transform is not fully hybridizable
       - Batchify is not fully supported by backend
   
   User will continue to use the existing `gluon.data.DataLoader`, and the 
conversion will be applied automatically
   ```python
   
   loader = gluon.data.DataLoader(hybrid_dataset.transform(hybrid_transform), 
batch_size=32, batchify_fn=hybrid_batchify)
   
   def DataLoader:
       def __init__(self, dataset, ...):
           if isinstance(dataset, _LazyTransformDataset) and 
is_hybrid(dataset._transform) and is_hybrid(dataset) and is_hybrid(batchify_fn):
               self._mt_dataloader = 
check_call(_LIB.MXMultiThreadDataLoaderCreate(...))
       def __iter__(self):
           if self._mt_dataloader:
                   return self._mt_dataloader
           else:
                  # fallback to single thread normal dataloader or 
multiprocessing dataloader
   
   ```
   
   With this change, mxnet 2.0 will get smooth transition to mixed data 
loaders. Please comment with specific examples where this proposal fail to 
accommodate.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to