elvin-n commented on code in PR #13621:
URL: https://github.com/apache/tvm/pull/13621#discussion_r1058100234
##########
python/tvm/relay/op/strategy/x86.py:
##########
@@ -627,16 +627,16 @@ def batch_matmul_strategy_cpu(attrs, inputs, out_type,
target):
if (
not attrs.transpose_a
and attrs.transpose_b
- and target_has_vnni(mcpu)
+ and target_has_avx512(mcpu)
and inputs[0].dtype == "uint8"
and inputs[1].dtype == "int8"
and inputs[1].shape[-2] % 16 == 0
and inputs[1].shape[-1] % 4 == 0
):
strategy.add_implementation(
- wrap_compute_batch_matmul(topi.x86.batch_matmul_vnni_compute,
need_out_dtype=True),
- wrap_topi_schedule(topi.x86.schedule_batch_matmul_vnni),
- name="batch_matmul_vnni.x86",
Review Comment:
I would consider amx vs vnni avx512 avx2 sse3 (btw, there is no sse2 for
int8, required instructions appeared if I am not mistaken in sse3.x) because
first is matrix multiplication, other ones are vector instructions. For now I
propose to go from local to generic and when we see needs in differentiate
vector sets, we will do this. For now pattern look similar for all of vector
instructions, the aspect of blocking should be added separately if it is not
done yet, The aspect of lanes in TVM intrinsic should be covered in this PR
> match inner loops to these varying sizes.
The inner loop is the same for all these instructions. It will be
```
for (int k = 0; k < 4; k++){
output[i] += data[k] * kernel[i][k]
}
```
> TVM is a compiler after all, to my knowledge the only capable of
auto-tensorization with arbitrary intrinsic.
I agree, at the same time I propose to move from local to generic patterns.
We do not limit anything for now
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]