[GitHub] [incubator-mxnet] connorgoggins opened a new pull request #17617: [LT] Fixed Spatial Transformer op

GitBox Mon, 17 Feb 2020 15:31:15 -0800

connorgoggins opened a new pull request #17617: [LT] Fixed Spatial Transformer 
op
URL: https://github.com/apache/incubator-mxnet/pull/17617
 
 
   ## Description ##
   The Spatial Transformer op was previously breaking on large tensor 
(dimension >= 2^32) data. With the following input:
   ```
   run_performance_test(nd.SpatialTransformer, run_backward=True, 
inputs=[{'data': (2, 2**29,1,6), 'loc': nd.random_normal(shape=(2,6)), 
'transform_type': 'affine', 'sampler_type': 'bilinear', 'target_shape': 
(2,6)}], warmup=1, runs=1)
   ```
   the following error was thrown:
   ```
   *** Error in `python3': double free or corruption (out): 0x00007f36bbffe010 
***
   ```
   
   To root cause this issue, I ran the previous command in a Python script with 
GDB, and found that the underlying problem was in the iteration portion of the 
forward and backward methods of `spatial_transformer.cc`. Several of the 
variables used in the iteration used the `int` dtype when they should have been 
using `index_t` to properly handle long int indices. I switched these variables 
to `index_t` in the forward and backward methods, and after rebuilding, the 
previous input command displayed the correct output:
   ```
   INFO:root:Begin Benchmark - SpatialTransformer
   INFO:root:Complete Benchmark - SpatialTransformer
   [{'SpatialTransformer': [{'inputs': {'data': (2, 536870912, 1, 6), 'loc': 
'<NDArray 2x6 @cpu(0)>', 'transform_type': 'affine', 'sampler_type': 
'bilinear', 't
   arget_shape': (2, 6)}, 'max_storage_mem_alloc_cpu/0': 102005472.0, 
'avg_time_forward_SpatialTransformer': 551614.125, 
'avg_time_backward_SpatialTransformer':
    511034.0938}]}]
   ```
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage
   - [x] Code is well-documented
   - [x] To the best of my knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - M src/operator/tensor/spatial_transformer.cc
   
   ## Comments ##
   Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with
   1. Individual op run
   2. Full OpPerf run
   
   ## Results ##
   The key difference between CPU and GPU tests was the instance type 
(r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, 
and both were tested using CPU context.
   
   [Single operator test - Embedding op 
(GPU)](https://gist.github.com/connorgoggins/7b3765e4e6f54f7841fb14e0930221ca)
   [Single operator test - Embedding op 
(CPU)](https://gist.github.com/connorgoggins/d97d64e73702a342bdeb170d30aef04f)
   
   [Full OpPerf test 
(GPU)](https://gist.github.com/connorgoggins/8b0563eaf98119980a5d36b7c61d796e)
   [Full OpPerf test 
(CPU)](https://gist.github.com/connorgoggins/4bebae70a85f7d9f754124187f52f383)
   
   @apeforest @access2rohit @ChaiBapchya


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] connorgoggins opened a new pull request #17617: [LT] Fixed Spatial Transformer op

Reply via email to