cfRod commented on issue #20265: URL: https://github.com/apache/incubator-mxnet/issues/20265#issuecomment-874751442
Hi @Zha0q1 , This is my WIP analysis so far (see verbose logs to support this) The convolution with grouping is calling into gemm reference kernels, which is expected as grouped convolutions is unsupported by Compute library at the moment. The case of manually slicing the input and running the convolution operation on the two slices of input and weights, here it is calling gemm:acl and the first half of the output is matches the output from the reference kernel but the second half doesn't. My initial analysis is that the primitive is created once and reused twice with different inputs and weights (see logs below). However Compute library assumes weights are immutable, which is because it does some preprocessing internally of the weights once for every new primitive creation and but in this case the second time around the new sets of weights are not used by ACL. ``` /home/ubuntu/mxnet/mxnet/src/executor/graph_executor.cc:1992: Subgraph backend MKLDNN is activated. [14:52:42] /home/ubuntu/mxnet/mxnet/src/executor/graph_executor.cc:1992: Subgraph backend MKLDNN is activated. dnnl_verbose,info,oneDNN v2.0.0 (commit 83ebc40d86bc54f0f23e947235e53570eeacf254) dnnl_verbose,info,cpu,runtime:OpenMP dnnl_verbose,info,cpu,isa:Generic dnnl_verbose,info,gpu,runtime:none # THIS IS GEMM REFFERENCE dnnl_verbose,create:cache_miss,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_g2ic4oc4_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.130859 dnnl_verbose,exec,cpu,convolution,gemm:ref,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcde:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_g2ic4oc4_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,19.759 # THIS IS GEMM ACL with sliced inputs dnnl_verbose,create:cache_miss,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.0969238 dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.415039 dnnl_verbose,exec,cpu,convolution,gemm:acl,forward_training,src_f32::blocked:abcd:f0 wei_f32::blocked:abcd:f0 bia_f32::blocked:a:f0 dst_f32::blocked:abcd:f0,,alg:convolution_direct,mb1_ic2oc2_ih9oh7kh3sh1dh0ph0_iw9ow7kw3sw1dw0pw0,0.321045 # THIS IS THE CONCAT STAGE dnnl_verbose,create:cache_miss,cpu,concat,simple:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,axis:1,1x2x7x7:1x2x7x7 1x4x7x7,0.0700684 dnnl_verbose,exec,cpu,concat,simple:any,undef,src_f32::blocked:abcd:f0 src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,,axis:1,1x2x7x7:1x2x7x7 1x4x7x7,0.468994 ``` We've seen a similar issue with primitive caching in TensorFlow and I am looking to confirm the hypothesis, possibly by modifying the test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org