ptrendx opened a new pull request #13890: Improve bulking in Gluon
URL: https://github.com/apache/incubator-mxnet/pull/13890
 
 
   ## Description ##
   This PR improves performance of hybridized Gluon models on the GPU by better 
bulking of ops (running GPU ops without synchronization).
   
   ## Checklist ##
   ### Essentials ###
   Please feel free to remove inapplicable items for your PR.
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [x] All changes have test coverage
   - [x] Code is well-documented: 
   - For new C++ functions in header files, their functionalities and arguments 
are documented. 
   - [x] To the my best knowledge, examples are either not affected by this 
change, or have been fixed to be compatible with this change
   
   ### Changes ###
   - [x] For models hybridized with `hybridize()` engine currently creates 
`ImperativeBulk` op that combines multiple operations into 1. However, this 
`ImperativeBulk` op captures entire ops, including synchronization at the end. 
Since the engine dependencies are only updated at the very end of the 
`ImperativeBulk` op, having this synchronization is wasteful. This PR changes 
the RunContext structure so that `ImperativeBulk` op is able to inform the 
bulked ops that it will handle the synchronization itself. Imperative ops check 
that argument to determine whether they need to perform their own 
synchronization.
   - [x] For models hybridized with `hybridize(static_alloc=True, 
static_Shape=True)` there is currently another mechanism of bulking, similar to 
the one used in symbolic API. However, it very aggressively  excludes ops from 
being eligible for bulking - any op that produces output from forward pass 
(which means most of the ops) cannot be bulked. This is artificial limitation 
and this PR lifts it.
   
   ## Comments ##
   - With this PR `hybridize(static_alloc=True)` and 
`hybridize(static_alloc=True, static_shape=True)` become pretty similar in 
performance. The main 2 differences seem to be:
      - a difference in the time it takes to start the forward pass (which I 
did not yet investigate, but non-static shape seems to take more time to start)
      - the fact that `ImperativeBulk` is a temporary `ThreadedEngineOpr` 
destroyed at the end of its invocation, which seems to be a relatively 
expensive operation (hybridized model with everything static mostly does not 
use `ImperativeBulk` so does not pay the cost of destruction of the object).
   - I will post some performance comparisons in a future comment. 
   
   @eric-haibin-lin @piiswrong 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to