[GitHub] [incubator-mxnet] Neutron3529 commented on pull request #19748: [v1.x] provide a faster PrefetchedDataLoader

GitBox Sat, 16 Jan 2021 18:03:15 -0800


Neutron3529 commented on pull request #19748:
URL: https://github.com/apache/incubator-mxnet/pull/19748#issuecomment-761715294



   > > The reason why using Dataloader with auto_reload is:
   > > MXNet 2.0's DataLoader with the default nopython mode prefetch data by 
default.
   > 
   > MXNet 2 uses version number 2 because it breaks APIs. MXNet uses 
https://semver.org/ and we must not introduce backward incompatible changes in 
the v1.x branch. (Changing defaults with major impact is backwards 
incompatible). It's fine to add new features in v1.x.
   
   most of the behavior is not changed since it is only prefetch data rather 
than modify data.
   
   > > There is only one iter for a DataLoader in most of the cases.(Thus only 
one prefetched iter is generated.)
   > > if we call iter explicitly, we should call it twice (one right after the 
define of the DataLoader, and another one after the previous iter is consumed).
   > 
   > So what's the problem here? Currently I'm not convinced your code / 
documentation is correct. For example:
   > 
   > ```
   >     >>> train_iter = 
DataLoader(train_data.transform_first(transform_train),
   >     ...                         batch_size=1,num_workers=1)
   >     (pre)fetching data here
   >     >>> it = iter(train_iter) # nothing is generated since lazy-evaluation 
occurs
   >     >>> it2 = iter(train_iter)
   >     >>> it3 = iter(train_iter)
   >     >>> it4 = iter(train_iter)
   >     >>> _ = next(it2) # the first iter we are using is the prefetched iter.
   >     >>> _ = next(it) # since the prefetched iter is cconsumed, we have to 
fetch data for `it`.
   > ```
   > 
   > However, looking at your implementation, actually 4 prefetched iters are 
created and the comments in the last two lines are wrong. Please correct me if 
you disagree.
   
   due to the lazy evaluation, the iter will not call `self.refresh/self.clean` 
until the first `__next__()` is called, thus we have 4 iters, but only the 
first iter we use (it2 here) is the prefetched iter.
   
   what's more, for a regular training procedure:
   ```python
       >>> train_data = ArrayDataset([i for i in range(10)],[9-i for i in 
range(10)])
       >>> def transform_train(sample):
       ...   if sample == 0 : print('(pre)fetching data here')
       ...   return sample
       ...
       >>> train_iter = DataLoader(train_data.transform_first(transform_train),
       ...                         auto_reload=False, 
batch_size=1,num_workers=1)
       >>> test_data = ArrayDataset([i for i in range(10)],[9-i for i in 
range(10)])
       >>> test_iter = DataLoader(test_data, batch_size=1,num_workers=1)
       >>> for epoch in range(200):
       ...   # there is almost no difference between it and the default 
DataLoader
       ...   for data, label in train_iter:
       ...     # training...
       ...   for data, label in test_iter:
       ...     # testing...
   ```
   there is only one iter per DataLoader each time. Most of the times, users 
will not consider what happened under the dataloader.
   
   > 
   > > (maybe we should not using with ag.record(): since "Explicit is better 
than implicit." (Zen of Python))
   > 
   > What's the relation to the current discussion?
   
   Here we `implicit` modify something help for calculate the gradient of the 
network.
   I say it just for that, it is fine for us to using some `implicit` 
operations to simplify the execution of the program.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-mxnet] Neutron3529 commented on pull request #19748: [v1.x] provide a faster PrefetchedDataLoader

Reply via email to