[GitHub] [incubator-mxnet] ptrendx opened a new pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses

GitBox Wed, 04 Mar 2020 16:45:27 -0800

ptrendx opened a new pull request #17767: [WIP] Fix and optimize handling of 
vectorized memory accesses
URL: https://github.com/apache/incubator-mxnet/pull/17767
 
 
   ## Description ##
   For operators, which performance is limited by global memory bandwidth, it 
is important to issue the widest possible loads, as it ensures that the 
bandwidth is fully utilized.
   
   Currently in MXNet we use vectorized loads and stores only for `half_t` type 
for a few operators (some elementwise binary operators and elementwise_sum). 
Unfortunately, the way it was done makes some assumptions about MXNet's 
NDArrays which do not hold true for all cases.
   
    - Failure 1:
   
   ```python
   import mxnet as mx
   ctx=mx.gpu()
   a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
   b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
   
   c = a[1:3]
   d = b[1:3]
   mx.nd.elemwise_add(c, d, out=c)
   ```
   Results in error:
   ```
   Check failed: e == cudaSuccess: CUDA: misaligned address
   ```
   
    - Failure 2:
   
   ```python
   import mxnet as mx
   ctx=mx.gpu()
   a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
   b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx)
   
   print(a)
   c = a[0:3]
   d = b[0:3]
   mx.nd.elemwise_add(c, d, out=c)
   mx.nd.waitall()
   print(c)
   print(a)
   ```
   
   gives:
   ```
   [1. 2. 3. 4.]
   <NDArray 4 @gpu(0)>
   
   [2. 4. 6.]
   <NDArray 3 @gpu(0)>
   
   [2. 4. 6. 8.]
   <NDArray 4 @gpu(0)>
   ```
   
   which is a silent data corruption (the last element of `a` should not have 
been changed).
   
   It was not noticed before because `a + b` for NDArrays launches the 
`broadcast_add` instead of `elemwise_add` (and is not vectorized), whereas in 
the symbolic execution slices give new allocations, which do not exhibit those 
issues. 
   
   This PR:
    - fixes those issues
    - introduces helpers for handling vectorization (for all types, not only 
`half_t`)
    - increases performance of vectorized kernels
    - introduces vectorization for all binary/unary/binary with scalar ops
    - (WIP) introduces vectorization for broadcast ops
   
   @eric-haibin-lin @sxjscience @haojin2 
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [ ] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage:
   - Unit tests are added for small changes to verify correctness (e.g. adding 
a new operator)
   - [ ] Code is well-documented: 
   - For new C++ functions in header files, their functionalities and arguments 
are documented. 
   - [x] To the best of my knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [x] Properly handle vectorized loads and stores in elementwise kernels
   - [ ] Handle vectorized loads and stores in elementwise broadcast kernels


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] ptrendx opened a new pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses

Reply via email to