connorgoggins opened a new pull request #17634: [Large Tensor] Fixed 
SoftmaxActivation op
URL: https://github.com/apache/incubator-mxnet/pull/17634
 
 
   ## Description ##
   The Softmax Activation op was previously breaking on large tensor (dimension 
>= 2^32) data. With the following input:
   ```
   run_performance_test(nd.SoftmaxActivation, run_backward=True, 
inputs=[{'data': (2**29,2,2,2), 'out': nd.random_normal(shape=(2**29,2,2,2))}], 
warmup=1, runs=1)
   ```
   the following error was thrown:
   ```
   TBlob.get_with_shape: Check failed: this->shape_.Size() == 
static_cast<size_t>(shape.Size()) (4294967296 vs. 0) : new and old shape do not 
match total elements
   ```
   
   To root cause this issue, I ran the previous command in a Python script with 
GDB, and found that the underlying problem was in the shape construction logic 
of `softmax_activation-inl.h`. In the functions for computing the forward pass 
result and the gradient, several of the variables used the `int` dtype when 
they should have been using `index_t` to properly handle long int dimensions. I 
switched these variables to `index_t` and, after rebuilding, the previous input 
command displayed the correct output:
   ```
   INFO:root:Begin Benchmark - SoftmaxActivation
   INFO:root:Complete Benchmark - SoftmaxActivation
   [{'SoftmaxActivation': [{'inputs': {'data': (536870912, 2, 2, 2), 'out': 
'<NDArray 536870912x2x2x2 @cpu(0)>'}, 'max_storage_mem_alloc_cpu/0': 
24696062.0, 'avg_time_forward_SoftmaxActivation': 7426.1191, 
'avg_time_backward_SoftmaxActivation': 16664.0254}]}]
   ```
   
   To ensure completeness and to prevent future breaking changes, I also added 
a nightly test for the Softmax Activation op with large tensor data in 
`tests/nightly/test_large_array.py`.
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage
   - [x] Code is well-documented
   - [x] To the best of my knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - M src/operator/nn/softmax_activation-inl.h
   - M tests/nightly/test_large_array.py
   
   ## Comments ##
   Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with
   1. Individual op run
   2. Full OpPerf run
   
   ## Results ##
   The key difference between CPU and GPU tests was the instance type 
(r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, 
and both were tested using CPU context.
   
   [Single operator test - SoftmaxActivation op 
(GPU)](https://gist.github.com/connorgoggins/5062d3043d04c68e23a80b23f5c0edb1)
   [Single operator test - SoftmaxActivation op 
(CPU)](https://gist.github.com/connorgoggins/09ebe15c3f3aefb34b2a854847d22030)
   
   [Full OpPerf test (GPU)]() - pending
   [Full OpPerf test (CPU)]() - pending
   
   @apeforest @access2rohit @ChaiBapchya 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to