asitstands opened a new pull request #9991: Random shuffle implementation
URL: https://github.com/apache/incubator-mxnet/pull/9991
 
 
   ## Description ##
   
   This operator randomly shuffles the elements of an NDArray along the last 
axis, i.e., it performs a batched shuffling. For example, if you shuffle a 
matrix (2D array), the elements in each row are shuffled inside the row and the 
shufflings in different rows are statistically independent. So, for each 
element, all indices except the last one perserved but the last one changes 
randomly. 
   
   ### Motivation
   
   1. Random shuffle is a widely used operation so that most of the random 
number APIs, such as those of numpy or C++, have it while mxnet lacks the 
operation. It would be useful especially for the shuffling of the arrays in gpu.
   2. If this PR is accepted I'd like to open a separate PR which removes the 
dependency of mxnet python API on the *global* random number generators of 
python and numpy. I think that the dependency is not sound because it 
introduces an unnecessary coupling between mxnet and other parts of a user's 
environment. This can be a source of subtle technical or mathematical bugs. 
Personally I have spent some amount of meaningless time to figure out the 
source of an unexpected randomness due to the coupling. The dependency has 
confused many newcomers, for example, see #9978, #9335, #7124 and #736. To 
replace random number generations through the global generators of python and 
numpy with the generations through the operations of `mx.nd.random`, we need to 
copy the data back and forth between a numpy array and an NDArray or a python 
list and an NDArray, but I think that the overhead from this is not significant 
and avoiding the coupling is much more important. When I actually tried the 
replacing, the only obstacle was that mxnet does not support the random 
shuffling. So I implemented it. However, I think that the suffling is useful 
regardless of this motivation and so opened a separate PR.
   
   ### Implementation
   
   1. In cpu, the shuffling is delegated to `__gnu_parallel::random_shuffle` 
for gcc and to `std::shuffle` for other compilers. The shufflings in a batch 
are processed sequentially one by one.
   2. In gpu, a random key of `double` type is generated for each element of 
the array and then the elements are sorted by the keys. The keys for `n`-th 
batch is distributed in the range `[n, n + 1]` and the entire array is sorted 
by the keys at once. The sorting is delegated to mxnet's `SortByKey` which 
again delegates the call to thrust's `sort_by_key`. If both of the batch size 
and the length of the last axis are very large, the qualitiy of the shuffling 
could be bad in this implementation, i.e., the outcomes may not be uniformly 
distributed. It is because, as `n` grows,  the number of floating point numbers 
in `[n, n+1]` becomes smaller and so the probability that there are duplicated 
keys grows. However, I believe that it hardly causes a problem in practice. If 
this is not acceptable, an alternative implementation is to process the 
shufflings in a batch sequentially. However, this alternative shows very poor 
performance comparing to the current implementation even when the batch size is 
some hundreds.
   3. Inplace shuffling is allowed. 
   
   ## Checklist ##
   ### Essentials ###
   - [x] Passed code style checking (`make lint`)
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding 
a new operator)
   - Nightly tests are added for complicated/long-running ones (e.g. changing 
distributed kvstore)
   - Build tests will be added for build configuration changes (e.g. adding a 
new build option with NCCL)
   - [ ] Code is well-documented: 
   - For user-facing API changes, API doc string has been updated. 
   - For new C++ functions in header files, their functionalities and arguments 
are documented. 
   - For new examples, README.md is added to explain the what the example does, 
the source of the dataset, expected performance on test set and reference to 
the original paper if applicable
   - [x] To the my best knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to