maybeLee opened a new issue #20805:
URL: https://github.com/apache/incubator-mxnet/issues/20805


   ## Description
   I guess this is caused by the incorrect size inference of variable 
`L[!Ta].size()` or `R[Tb].size()` in `src/operator/tensor/./dot-inl.h:1241` but 
not very sure.
   I was constructing a model using Keras-Mxnet as the high-level API, the 
model information is as follows:
   ```
   _________________________________________________________________
   Layer (type)                 Output Shape              Param #   
   =================================================================
   input_12 (InputLayer)        (None, 28, 28, 3)         0         
   _________________________________________________________________
   average_pooling2d_8 (Average (None, 28, 28, 3)         0         
   _________________________________________________________________
   flatten_11 (Flatten)         (None, 2352)              0         
   _________________________________________________________________
   dense_11 (Dense)             (None, 100)               235300    
   =================================================================
   Total params: 235,300
   Trainable params: 235,300
   Non-trainable params: 0
   _________________________________________________________________
   ```
   But face the error:
   ```
   MXNetError: MXNetError: Error in operator dot16: [06:11:14] 
../src/operator/tensor/./dot-inl.h:1241: Check failed: L[!Ta].Size() == 
R[Tb].Size() (2523 vs. 2352) : dot shape error: [10,2523] X [2352,100]
   
   During handling of the above exception, another exception occurred:
   
   RuntimeError                              Traceback (most recent call last)
   /usr/local/lib/python3.7/dist-packages/mxnet/symbol/symbol.py in 
simple_bind(self, ctx, grad_req, type_dict, stype_dict, group2ctx, 
shared_arg_names, shared_exec, shared_buffer, **kwargs)
      1942                 error_msg += "%s: %s\n" % (k, v)
      1943             error_msg += "%s" % e
   -> 1944             raise RuntimeError(error_msg)
      1945 
      1946         # update shared_buffer
   
   RuntimeError: simple_bind error. Arguments:
   /input_121: (10, 28, 28, 3)
   MXNetError: Error in operator dot16: [06:11:14] 
../src/operator/tensor/./dot-inl.h:1241: Check failed: L[!Ta].Size() == 
R[Tb].Size() (2523 vs. 2352) : dot shape error: [10,2523] X [2352,100]
   ```
   After checking the source code in `./dot-inl.h` I notice that the 
L[!Ta].Size() and R[Tb].Size() is inconsistent which cause such an issue. 
However, the output shape of Flatten layer should be 2352 instead of 2523. 
That's why I guess the inference of `L[!Ta].size()` may be wrong.
   
   ## To Reproduce
   ```
   import os
   os.environ["KERAS_BACKEND"] = "mxnet"
   import keras
   weight=28
   x = keras.layers.Input((weight, weight,3))
   layer_stack = [
                  keras.layers.AveragePooling2D(pool_size=[2,2], strides=1, 
padding="same"),
                 #  keras.layers.AveragePooling1D(pool_size=1, strides=1, 
padding="same", data_format="channels_last"),
                  keras.layers.Flatten(),
                  keras.layers.Dense(100),
   ]
   temp = x
   for layer in layer_stack:
     y = layer(temp)
     temp = y
   model = keras.Model(x,y)
   model.summary()
   import numpy as np
   input = np.random.rand(10,weight, weight,3)
   res = model.predict(input)
   ```
   
   ### Steps to reproduce
   Just run the above code with Keras-Mxnet or access the link:
   
https://colab.research.google.com/drive/1i-gOcbdah1-S_Hdk5_xme0u0CsRmL2Le?usp=sharing
   
   ## What have you tried to solve it?
   I haven't figured out how to fix this issue yet but I can share my 
experience as follows:
   1. This bug exists in the latest MXNet release (1.9.1).
   2. This bug is triggered when we set `strides=1` and `pool_size` to be even 
when using the `AveragePooling2D` layer. And the `Flatten` and `Dense` layers 
are necessary for triggering such a crash.
   3. I check the `pooling_convention` that Keras-Mxnet sent to MXNet's 
`mx.sym.Pooling`, it is `(1,1)` which is consistent with `padding=same` so I 
guess Keras-Mxnet is not the one to blame.
   
   In conclusion, I guess it is caused by incorrect size inference of the 
`DotShape` function in `src/operator/tensor/./dot-inl.h` file, specifically, it 
is possibly caused by these four branches:
   ```
   src/operator/tensor/./dot-inl.h:1221
   if (Ta) {
     L[0] = mshadow::Shape1(lshape[0]);
     L[1] = lshape.ndim() > 1 ?
            mxnet::TShape(&lshape[1], lshape.end()) : mxnet::TShape(1, 1);
   } else {
     L[0] = lshape.ndim() > 1 ?
            mxnet::TShape(&lshape[0], &lshape[lshape.ndim()-1]) : 
mxnet::TShape(1, 1);
     L[1] = mshadow::Shape1(lshape[lshape.ndim()-1]);
   }
   if (Tb) {
     R[0] = rshape.ndim() > 1 ?
            mxnet::TShape(&rshape[0], &rshape[rshape.ndim()-1]) : 
mxnet::TShape(1, 1);
     R[1] = mshadow::Shape1(rshape[rshape.ndim()-1]);
   } else {
     R[0] = mshadow::Shape1(rshape[0]);
     R[1] = rshape.ndim() > 1 ?
            mxnet::TShape(&rshape[1], rshape.end()) : mxnet::TShape(1, 1);
   }
   
   ```
   Can you help check whether this is the actual bug and how can we fix it?
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to