wzh99 commented on issue #11704:
URL: https://github.com/apache/tvm/issues/11704#issuecomment-1157146229

   @ganler Thanks for your investigation into this bug. I also look into the 
compilation process of this test case. I would like to share my observation as 
well. 
   
   The direct cause of this bug is here:
   
https://github.com/apache/tvm/blob/ec918644ef01df81354bcf958f686e2b8863dac4/python/tvm/topi/x86/dense.py#L69-L70
   The TOPI implementation of `dense_pack.x86` here assumes that `s[O].op.axis` 
has two elements. However, for this test case, `s[O].op.axis` has three 
dimensions. Therefore, the Python interpreter reports a `ValueError`. 
   
   A quick fix of this bug is to modify the condition of this if-statement as 
follows:
   ```python
   if C != O and len(s[O].op.axis) == 2:
       y, x = s[O].op.axis
       ...
   ```
   The then-branch performs additional transformations on the schedule. Without 
these transformations, the schedule is still valid (perhaps with performance 
degradation). At least there will be no `ValueError`. 
   
   However, I do not think that the problem is completely resolved here. In the 
TOPI implementation of `dense_pack.x86`, I find out in my Python debugger that 
the rank of symbolic tensor `O` is 3. I just wonder why a tensor of rank 3 is 
passed to TOPI implementation of `dense_pack.x86` which always outputs a tensor 
of rank 2? 
   
   Here is my analysis. This test case is compiled at optimization level 1. At 
this level, one important optimization is operator fusion. I think that this 
rank mismatch is possibly caused by operator fusion. I print out the fused 
Relay program in the following: 
   ```
   def @main(%x0 {virtual_device=VirtualDevice(device_type=1, 
virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 
2), float32] */, %x1 {virtual_device=VirtualDevice(device_type=1, 
virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 
2), float32] */, %x2 {virtual_device=VirtualDevice(device_type=1, 
virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 
2), float32] */, %w {virtual_device=VirtualDevice(device_type=1, 
virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0}, host=Target(kin
 d='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))}: Tensor[(2, 2), 
float32] /* ty=Tensor[(2, 2), float32] */, hash="a9246759e5fdd017", 
virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, 
target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, 
host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))) -> 
Tensor[(1, 1, 2), float32] {
     %4 = fn (%p0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, 
%p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p2: Tensor[(1, 
2), float32] /* ty=Tensor[(1, 2), float32] */, %p3: Tensor[(2, 2), float32] /* 
ty=Tensor[(2, 2), float32] */, Primitive=1, hash="b905c6cc3d297a44") -> 
Tensor[(1, 1, 2), float32] {
       %0 = maximum(%p0, %p1) /* ty=Tensor[(1, 2), float32] */;
       %1 = nn.dense(%p2, %p3, units=None) /* ty=Tensor[(1, 2), float32] */;
       %2 = expand_dims(%0, axis=1) /* ty=Tensor[(1, 1, 2), float32] */;
       %3 = multiply(%0, %1) /* ty=Tensor[(1, 2), float32] */;
       add(%2, %3) /* ty=Tensor[(1, 1, 2), float32] */
     } /* ty=fn (Tensor[(1, 2), float32], Tensor[(1, 2), float32], Tensor[(1, 
2), float32], Tensor[(2, 2), float32]) -> Tensor[(1, 1, 2), float32] */;
     %4(%x0, %x1, %x2, %w) /* ty=Tensor[(1, 1, 2), float32] */
   }
   ```
   It seems that the whole graph is fused to a single group. The output of this 
group is a tensor of rank 3. Since a group is a single scheduling unit, I guess 
that is why `dense_pack.x86` takes a rank 3 tensor as `O`. 
   
   I also try a simpler program with a `dense` and a broadcasting operator:
   ```
   def @main(%x1: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 1, 2), float32] 
*/, %x2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w: 
Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */) -> Tensor[(1, 1, 2), 
float32] {
     %0 = nn.dense(%x2, %w, units=None) /* ty=Tensor[(1, 2), float32] */;
     add(%x1, %0) /* ty=Tensor[(1, 1, 2), float32] */
   }
   ```
   The fused version is:
   ```
   def @main(%x1 {virtual_device=VirtualDevice(device_type=1, 
virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0})))}: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 
1, 2), float32] */, %x2 {virtual_device=VirtualDevice(device_type=1, 
virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 
2), float32] */, %w {virtual_device=VirtualDevice(device_type=1, 
virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, 
attrs={'link-params': (bool)0})))}: Tensor[(2, 2), float32] /* ty=Tensor[(2, 
2), float32] */, hash="c9ab1788559517b8", 
virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, 
target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params':
  (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': 
(bool)0})))) -> Tensor[(1, 1, 2), float32] {
     %0 = fn (%p01: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, 
%p11: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, Primitive=1, 
hash="229658a1737b78d0") -> Tensor[(1, 2), float32] {
       nn.dense(%p01, %p11, units=None) /* ty=Tensor[(1, 2), float32] */
     } /* ty=fn (Tensor[(1, 2), float32], Tensor[(2, 2), float32]) -> 
Tensor[(1, 2), float32] */;
     %1 = %0(%x2, %w) /* ty=Tensor[(1, 2), float32] */;
     %2 = fn (%p0: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 1, 2), float32] 
*/, %p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, Primitive=1, 
hash="f8991b5d265cd460") -> Tensor[(1, 1, 2), float32] {
       add(%p0, %p1) /* ty=Tensor[(1, 1, 2), float32] */
     } /* ty=fn (Tensor[(1, 1, 2), float32], Tensor[(1, 2), float32]) -> 
Tensor[(1, 1, 2), float32] */;
     %2(%x1, %1) /* ty=Tensor[(1, 1, 2), float32] */
   }
   ```
   In this case, the `dense` and the broadcasting `add` are NOT fused together. 
I have no idea why they are fused in the original program. Perhaps there is a 
bug in the fusing algorithm. However, I am not familiar with the details of the 
operator fusion implementation. Perhaps I need some help to have a complete 
understanding of this problem. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to