This is an automated email from the ASF dual-hosted git repository.

ruihangl pushed a commit to branch unity
in repository https://gitbox.apache.org/repos/asf/tvm.git


The following commit(s) were added to refs/heads/unity by this push:
     new 804ce09195 [Unity] Update CUTLASS Attention to incorprate upstream 
change (#15309)
804ce09195 is described below

commit 804ce09195b1421296dcb979815ba5033ea58d0d
Author: masahi <[email protected]>
AuthorDate: Fri Jul 14 03:19:45 2023 +0900

    [Unity] Update CUTLASS Attention to incorprate upstream change (#15309)
    
    https://github.com/NVIDIA/cutlass/pull/992
    
    On RTX 4080, I got modest speed up on SD (49 -> 50 it / sec), and tiny 
speed up on Vicuna 7b (+ 0.2 - 0.3 tok / s).
    
    They recommend "setting blockSize to 64x128 rather than 32x128 after Sm75". 
On RTX 4080 + SD, there is no perf difference. But 64x128 is slightly slower 
than 32x128 on the attention workload in the vicuna decoder. So I didn't change 
the block size.
---
 3rdparty/cutlass                                  | 2 +-
 python/tvm/contrib/cutlass/attention_operation.py | 2 +-
 python/tvm/contrib/cutlass/gen_tensor_op.py       | 2 ++
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/3rdparty/cutlass b/3rdparty/cutlass
index f679663224..146d314057 160000
--- a/3rdparty/cutlass
+++ b/3rdparty/cutlass
@@ -1 +1 @@
-Subproject commit f679663224ef5a67c33dc94f89619128a53221c1
+Subproject commit 146d314057c5f193a70c2b36896e739c8c60aef4
diff --git a/python/tvm/contrib/cutlass/attention_operation.py 
b/python/tvm/contrib/cutlass/attention_operation.py
index 55c1ccd616..7240f24de4 100644
--- a/python/tvm/contrib/cutlass/attention_operation.py
+++ b/python/tvm/contrib/cutlass/attention_operation.py
@@ -88,7 +88,7 @@ def instantiate_attention_template(attrs):
                       /*is_aligned=*/${kIsAligned},
                       /*queries_per_block=*/${kQueriesPerBlock},
                       /*keys_per_block=*/${kKeysPerBlock},
-                      /*single_value_iteration=*/${kSingleValueIteration},
+                      /*kMaxK=*/${kMaxK},
                       /*supports_dropout=*/${kSupportsDropout},
                       /*supports_bias=*/${kSupportsBias}
       >;
diff --git a/python/tvm/contrib/cutlass/gen_tensor_op.py 
b/python/tvm/contrib/cutlass/gen_tensor_op.py
index 2988f9a8a2..0aaafe8505 100644
--- a/python/tvm/contrib/cutlass/gen_tensor_op.py
+++ b/python/tvm/contrib/cutlass/gen_tensor_op.py
@@ -729,6 +729,8 @@ def instantiate_template(func_name, annotations, func_args):
         attrs["num_heads"] = n = annotations["num_heads"]
         attrs["head_dim"] = h = annotations["head_dim"]
         attrs["head_dim_value"] = h_v = annotations["head_dim_value"]
+        attrs["kMaxK"] = max(int(attrs["head_dim"]), 
int(attrs["head_dim_value"]))
+
         data_type_size = DataTypeSize[data_type]
         if (data_type_size * h // 8) % 16 == 0 and (data_type_size * h_v // 8) 
% 16 == 0:
             attrs["kIsAligned"] = True

Reply via email to