wzh99 commented on issue #11704: URL: https://github.com/apache/tvm/issues/11704#issuecomment-1157146229
@ganler Thanks for your investigation into this bug. I also look into the compilation process of this test case. I would like to share my observation as well. The direct cause of this bug is here: https://github.com/apache/tvm/blob/ec918644ef01df81354bcf958f686e2b8863dac4/python/tvm/topi/x86/dense.py#L69-L70 The TOPI implementation of `dense_pack.x86` here assumes that `s[O].op.axis` has two elements. However, for this test case, `s[O].op.axis` has three dimensions. Therefore, the Python interpreter reports a `ValueError`. A quick fix of this bug is to modify the condition of this if-statement as follows: ```python if C != O and len(s[O].op.axis) == 2: y, x = s[O].op.axis ... ``` The then-branch performs additional transformations on the schedule. Without these transformations, the schedule is still valid (perhaps with performance degradation). At least there will be no `ValueError`. However, I do not think that the problem is completely resolved here. In the TOPI implementation of `dense_pack.x86`, I find out in my Python debugger that the rank of symbolic tensor `O` is 3. I just wonder why a tensor of rank 3 is passed to TOPI implementation of `dense_pack.x86` which always outputs a tensor of rank 2? Here is my analysis. This test case is compiled at optimization level 1. At this level, one important optimization is operator fusion. I think that this rank mismatch is possibly caused by operator fusion. I print out the fused Relay program in the following: ``` def @main(%x0 {virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %x1 {virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %x2 {virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w {virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kin d='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))}: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, hash="a9246759e5fdd017", virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))) -> Tensor[(1, 1, 2), float32] { %4 = fn (%p0: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p3: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, Primitive=1, hash="b905c6cc3d297a44") -> Tensor[(1, 1, 2), float32] { %0 = maximum(%p0, %p1) /* ty=Tensor[(1, 2), float32] */; %1 = nn.dense(%p2, %p3, units=None) /* ty=Tensor[(1, 2), float32] */; %2 = expand_dims(%0, axis=1) /* ty=Tensor[(1, 1, 2), float32] */; %3 = multiply(%0, %1) /* ty=Tensor[(1, 2), float32] */; add(%2, %3) /* ty=Tensor[(1, 1, 2), float32] */ } /* ty=fn (Tensor[(1, 2), float32], Tensor[(1, 2), float32], Tensor[(1, 2), float32], Tensor[(2, 2), float32]) -> Tensor[(1, 1, 2), float32] */; %4(%x0, %x1, %x2, %w) /* ty=Tensor[(1, 1, 2), float32] */ } ``` It seems that the whole graph is fused to a single group. The output of this group is a tensor of rank 3. Since a group is a single scheduling unit, I guess that is why `dense_pack.x86` takes a rank 3 tensor as `O`. I also try a simpler program with a `dense` and a broadcasting operator: ``` def @main(%x1: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 1, 2), float32] */, %x2: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */) -> Tensor[(1, 1, 2), float32] { %0 = nn.dense(%x2, %w, units=None) /* ty=Tensor[(1, 2), float32] */; add(%x1, %0) /* ty=Tensor[(1, 1, 2), float32] */ } ``` The fused version is: ``` def @main(%x1 {virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))}: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 1, 2), float32] */, %x2 {virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))}: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %w {virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))}: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, hash="c9ab1788559517b8", virtual_device=VirtualDevice(device_type=1, virtual_device_id=0, target=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0}, host=Target(kind='llvm', keys={'cpu'}, attrs={'link-params': (bool)0})))) -> Tensor[(1, 1, 2), float32] { %0 = fn (%p01: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, %p11: Tensor[(2, 2), float32] /* ty=Tensor[(2, 2), float32] */, Primitive=1, hash="229658a1737b78d0") -> Tensor[(1, 2), float32] { nn.dense(%p01, %p11, units=None) /* ty=Tensor[(1, 2), float32] */ } /* ty=fn (Tensor[(1, 2), float32], Tensor[(2, 2), float32]) -> Tensor[(1, 2), float32] */; %1 = %0(%x2, %w) /* ty=Tensor[(1, 2), float32] */; %2 = fn (%p0: Tensor[(1, 1, 2), float32] /* ty=Tensor[(1, 1, 2), float32] */, %p1: Tensor[(1, 2), float32] /* ty=Tensor[(1, 2), float32] */, Primitive=1, hash="f8991b5d265cd460") -> Tensor[(1, 1, 2), float32] { add(%p0, %p1) /* ty=Tensor[(1, 1, 2), float32] */ } /* ty=fn (Tensor[(1, 1, 2), float32], Tensor[(1, 2), float32]) -> Tensor[(1, 1, 2), float32] */; %2(%x1, %1) /* ty=Tensor[(1, 1, 2), float32] */ } ``` In this case, the `dense` and the broadcasting `add` are NOT fused together. I have no idea why they are fused in the original program. Perhaps there is a bug in the fusing algorithm. However, I am not familiar with the details of the operator fusion implementation. Perhaps I need some help to have a complete understanding of this problem. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
