anwang2009 commented on code in PR #10450:
URL: https://github.com/apache/tvm/pull/10450#discussion_r848721337
##########
python/tvm/topi/cuda/tensorcore_alter_op.py:
##########
@@ -148,16 +148,18 @@ def _dense_legalize(attrs, inputs, arg_types):
# Pad input and output channels to use tensorcore schedule.
if dtype in ["float16", "int8", "uint8"]:
- # The shape of (M, K, N) must be multiple of (16, 16, 16) or (32, 16,
8) or (8, 16, 32)
+ # The shape of (M, K, N) must be multiple of
+ # (16, 16, 16) or (32, 16, 8) or (8, 16, 32) or (4, 4, 4)
if (
(M % 8 == 0 and K % 16 == 0 and N % 32 == 0)
or (M % 16 == 0 and K % 16 == 0 and N % 16 == 0)
or (M % 32 == 0 and K % 16 == 0 and N % 8 == 0)
+ or (M % 4 == 0 and K % 4 == 0 and N % 4 == 0)
):
# no need to pad
return None
- candidates = [(16, 16, 16), (32, 16, 8), (8, 16, 32)]
+ candidates = [(16, 16, 16), (32, 16, 8), (8, 16, 32), (4, 4, 4)]
Review Comment:
PTAL @AndrewZhaoLuo @masahi
In particular this allows tighter padding boxes in order to enable onnx cuda
tests where the shapes are on the order of (2, 3) x (3, 4), because padding
fails if the padding boxes are not densely populated enough with real data. I
imagine one potential downside is that larger tensors might be computed faster
in broad sweeps with 16x16x16 padding vs the finer grained 4x4x4, but I have no
concrete evidence either way.
wdyt, are we ok with adding the 4x4x4 padding target?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]