[GitHub] marcoabreu commented on issue #13598: More fine-grained operator implementation dispatch & memory planning flow

GitBox Tue, 11 Dec 2018 12:39:43 -0800

marcoabreu commented on issue #13598: More fine-grained operator implementation 
dispatch & memory planning flow 
URL: 
https://github.com/apache/incubator-mxnet/issues/13598#issuecomment-446353433
 
 
   Thanks for your very good questions! 
   
   For the operator selection I would think about a design which has something 
similar to a "tuning" or warm-up stage which evaluates the different 
possibilities. Initially, since that revamp would be quite big and 
experimental, I would hardcode an order (e.g. CUDA->AMD->MKLDNN->CPU) which is 
then evaluated and certain backends dropped if they don't support that operator 
or they're simply not present. Later on, there would ideally be a benchmark 
step which evaluates the different possibilities and then chooses the most 
efficient representation of the graph. This evaluation would first start with 
simple benchmarks (with different strategies like memory footprint, power 
consumption, throughput, etc) of each operator backend and then in the next 
stage go one level higher and evaluate groups of operators (up to evaluating 
the entire graph) to accomodate for layout conversion and memcopy overhead.  In 
the last iteration, we would have a graph which is most efficienct, but also 
runnable on that hardware, for the requested graph.
   
   There are two ways I could think of backends conflicting:
   1. Mismatching memory layouts
   2. Impossible/unlikely combinations (CUDA &AMDHIP or MKL &ARM)
   
   To solve number one, I would extend the design to not only have the 
operators abstracted, but also their memory layouts. In the same way as we 
would have an operator registry, we would have a memory layout registry where 
each backend announces their memory layouts as well as converters. Each 
operator implementation would specify a desired layout (most likely the one 
they registered themselfes). Now imagine you have a graph with threeoperators:
   ```
   Input -> Operator1_CUDA -> Operator2_MKL -> Operator3_MKL -> Output
   ```
   These three operators are from two entirely different backends and have 
their own implementation and memory layouts. Our engine would, during the 
initial analysis of the graph (this step is after the optional graph 
optimization and we assume the graph as final at that point), analyse the 
desired layout of each operator (in this case CUDA and MKL, but it could also 
go a level deeper like CUDA_NHWC etc) and then see whether they are compatible. 
If they are not, the engine would request a converter from the memory layout 
registry. These converters would then be inserted into the graph and the final 
graph would look as follows:
   ```
   Input -> Convert_Standard_CUDA -> Operator1_CUDA -> Convert_CUDA_MKL -> 
Operator2_MKL -> Operator3_MKL -> Convert_MKL_Standard -> Output
   ```
   This way, you will always have compatibility in between the different 
layouts while the neither the operators nor the engine will have to care about 
the different backends as that conversion happens in between. When an operator 
receives and outputs data, it expects to be in its "isolated" world. If the 
operators are from the same backend and use the same layout though, this 
conversion is skipped and a performance advantage is achieved.
   Now at this point you could get to O(N!) if you need convertors in between 
every single possible memory layout. The trick here is to have a standard 
layout (which we basically already have and is used to input and output data 
from the graphs). Each memory layout has to register at least two converters: 
TO_STANDARD and FROM_STANDARD. This allows have compatibility for backends 
where no direct conversion exists. Since this will require two conversions 
(FROM_MEMLAYOUT1_TO_STANARD and FROM_STANDARD_TO_MEMLAYOUT2), this will have 
additional overhead but keep compatibility high. For common cases, where would 
probably be direct converters. 
   
   For the second case where conflicting backends exist, they would simply be 
skipped during the evaluation stage when the engine checks whether an operator 
is actually eligible. So if CUDA is not present, the operator will simply not 
be considered for that graph.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] marcoabreu commented on issue #13598: More fine-grained operator implementation dispatch & memory planning flow

Reply via email to