samskalicky commented on a change in pull request #17270: [WIP] Dynamic custom
operator GPU support
URL: https://github.com/apache/incubator-mxnet/pull/17270#discussion_r366123444
##########
File path: src/c_api/c_api.cc
##########
@@ -720,8 +751,11 @@ int MXLoadLib(const char *path) {
gradOp.set_attr<bool>("TIsLayerOpBackward", true, plevel);
gradOp.set_attr<FStatefulComputeEx>("FStatefulComputeEx<cpu>",
fstateful_backward, plevel);
+ gradOp.set_attr<FStatefulComputeEx>("FStatefulComputeEx<gpu>",
Review comment:
> What is the target supported contexts for this feature? Do we target just
cpu and gpu, or we want to support other hardware backends, too?
Yes, lets just target the currently supported contexts. We can swing back
around later when we add support dynamic loading of contexts. We'll make sure
to implement something generic in the PR (ie. setting context with string
rather than enum) so it will just work.
> Currently the dispatch logic is inside FCompute, which is a bit different
from existing mxnet users' experience. Usually the FCompute only declares the
computation, and leave the dispatch logic to MXNet executor. And it's unclear
how it supports the case where the same op is extended by a library for Intel
CPUs and NVIDIA GPUs - they may hard-code the dispatch logic to only care about
their own hardware. How do we handle such conflicts?
Good catch. We'll change it so that users can specify context and
Forward/Backward function in the registration. But, for custom operators we can
only support what the current implementation in MXNet allows. Which is that the
top level scope is an operator, and it has implementations for different
contexts.
What you're describing is top level being context and inside of that having
a distinct operator registration. Like you describe next, this organization is
not supported in MXNet. We should discuss this as part of a separate feature
enhancement than this PR (the scope of this PR is to add GPU support to custom
operators -- only).
> Furthermore, currently the infer_shape/infer_dtype is not context-aware,
i.e. CPU and GPU infers the same dtype. However, it may not be true (e.g. cpu
supports fp32 and bfloat16, and gpu supports fp32 and fp16). How do we handle
these attribute conflict?
>
> I had a short discussion with @yzhliu and we saw two potential fixes:
>
> 1. make infer_shape/infer_dtype context aware. This way we can have
different infer_dtype function for cpu & gpu. MXNet needs to dispatch to the
function based on the current context. For example,
`op.set_attr<FInferType>("FInferType<cpu>", my_infer_type_function)` for cpu
specific type inference, and `op.set_attr<FInferType>("FInferType<gpu>",
my_infer_type_function_gpu)`for gpu.
> 2. Another way is to register ops with different names (e.g. 'cpu_gemm'
and 'gpu_gemm'). This way they can have different infer_attr functions. But we
don't want users to modify their model definition in the training script to
these names. To mitigate that we can have an API to allow user to provide a
mapping (e.g. {'gemm' -> 'cpu_gemm'}) for mxnet to map an op to another op
registered in the backend.
This is a good idea, we should have a separate PR to add this new feature to
MXNet backend, and then extend it out to custom operators. But this is out of
scope for this PR.
Another point is that current custom operator support does not allow for
registering CPU implementations in one library, and GPU implementations for the
same operator in another. This merging of functionality from different
libraries is a good idea for a future feature.
> Finally, is there a plan to support dynamic custom context? :P @samskalicky
I'll add it to the list behind: custom data loaders, graph passes, etc... :D
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services