anwang2009 commented on code in PR #10450:
URL: https://github.com/apache/tvm/pull/10450#discussion_r848721337


##########
python/tvm/topi/cuda/tensorcore_alter_op.py:
##########
@@ -148,16 +148,18 @@ def _dense_legalize(attrs, inputs, arg_types):
 
     # Pad input and output channels to use tensorcore schedule.
     if dtype in ["float16", "int8", "uint8"]:
-        # The shape of (M, K, N) must be multiple of (16, 16, 16) or (32, 16, 
8) or (8, 16, 32)
+        # The shape of (M, K, N) must be multiple of
+        # (16, 16, 16) or (32, 16, 8) or (8, 16, 32) or (4, 4, 4)
         if (
             (M % 8 == 0 and K % 16 == 0 and N % 32 == 0)
             or (M % 16 == 0 and K % 16 == 0 and N % 16 == 0)
             or (M % 32 == 0 and K % 16 == 0 and N % 8 == 0)
+            or (M % 4 == 0 and K % 4 == 0 and N % 4 == 0)
         ):
             # no need to pad
             return None
 
-        candidates = [(16, 16, 16), (32, 16, 8), (8, 16, 32)]
+        candidates = [(16, 16, 16), (32, 16, 8), (8, 16, 32), (4, 4, 4)]

Review Comment:
   PTAL @AndrewZhaoLuo @masahi 
   
   In particular this allows tighter padding boxes in order to enable onnx cuda 
tests where the shapes are on the order of (2, 3) x (3, 4), because padding 
fails if the padding boxes are not densely populated enough with real data. I 
imagine one potential downside is that larger tensors might be computed faster 
in broad sweeps with 16x16x16 padding vs the finer grained 4x4x4, but I have no 
concrete evidence either way.
    
   wdyt, are we ok with adding the 4x4x4 padding target?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to