ptrendx opened a new pull request #17767: [WIP] Fix and optimize handling of vectorized memory accesses URL: https://github.com/apache/incubator-mxnet/pull/17767 ## Description ## For operators, which performance is limited by global memory bandwidth, it is important to issue the widest possible loads, as it ensures that the bandwidth is fully utilized. Currently in MXNet we use vectorized loads and stores only for `half_t` type for a few operators (some elementwise binary operators and elementwise_sum). Unfortunately, the way it was done makes some assumptions about MXNet's NDArrays which do not hold true for all cases. - Failure 1: ```python import mxnet as mx ctx=mx.gpu() a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx) b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx) c = a[1:3] d = b[1:3] mx.nd.elemwise_add(c, d, out=c) ``` Results in error: ``` Check failed: e == cudaSuccess: CUDA: misaligned address ``` - Failure 2: ```python import mxnet as mx ctx=mx.gpu() a = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx) b = mx.nd.array([1,2,3,4], dtype='float16', ctx=ctx) print(a) c = a[0:3] d = b[0:3] mx.nd.elemwise_add(c, d, out=c) mx.nd.waitall() print(c) print(a) ``` gives: ``` [1. 2. 3. 4.] <NDArray 4 @gpu(0)> [2. 4. 6.] <NDArray 3 @gpu(0)> [2. 4. 6. 8.] <NDArray 4 @gpu(0)> ``` which is a silent data corruption (the last element of `a` should not have been changed). It was not noticed before because `a + b` for NDArrays launches the `broadcast_add` instead of `elemwise_add` (and is not vectorized), whereas in the symbolic execution slices give new allocations, which do not exhibit those issues. This PR: - fixes those issues - introduces helpers for handling vectorization (for all types, not only `half_t`) - increases performance of vectorized kernels - introduces vectorization for all binary/unary/binary with scalar ops - (WIP) introduces vectorization for broadcast ops @eric-haibin-lin @sxjscience @haojin2 ## Checklist ## ### Essentials ### Please feel free to remove inapplicable items for your PR. - [ ] Changes are complete (i.e. I finished coding on this PR) - [ ] All changes have test coverage: - Unit tests are added for small changes to verify correctness (e.g. adding a new operator) - [ ] Code is well-documented: - For new C++ functions in header files, their functionalities and arguments are documented. - [x] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change ### Changes ### - [x] Properly handle vectorized loads and stores in elementwise kernels - [ ] Handle vectorized loads and stores in elementwise broadcast kernels
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
