This is an automated email from the ASF dual-hosted git repository.
ruihangl pushed a commit to branch unity
in repository https://gitbox.apache.org/repos/asf/tvm.git
The following commit(s) were added to refs/heads/unity by this push:
new 804ce09195 [Unity] Update CUTLASS Attention to incorprate upstream
change (#15309)
804ce09195 is described below
commit 804ce09195b1421296dcb979815ba5033ea58d0d
Author: masahi <[email protected]>
AuthorDate: Fri Jul 14 03:19:45 2023 +0900
[Unity] Update CUTLASS Attention to incorprate upstream change (#15309)
https://github.com/NVIDIA/cutlass/pull/992
On RTX 4080, I got modest speed up on SD (49 -> 50 it / sec), and tiny
speed up on Vicuna 7b (+ 0.2 - 0.3 tok / s).
They recommend "setting blockSize to 64x128 rather than 32x128 after Sm75".
On RTX 4080 + SD, there is no perf difference. But 64x128 is slightly slower
than 32x128 on the attention workload in the vicuna decoder. So I didn't change
the block size.
---
3rdparty/cutlass | 2 +-
python/tvm/contrib/cutlass/attention_operation.py | 2 +-
python/tvm/contrib/cutlass/gen_tensor_op.py | 2 ++
3 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/3rdparty/cutlass b/3rdparty/cutlass
index f679663224..146d314057 160000
--- a/3rdparty/cutlass
+++ b/3rdparty/cutlass
@@ -1 +1 @@
-Subproject commit f679663224ef5a67c33dc94f89619128a53221c1
+Subproject commit 146d314057c5f193a70c2b36896e739c8c60aef4
diff --git a/python/tvm/contrib/cutlass/attention_operation.py
b/python/tvm/contrib/cutlass/attention_operation.py
index 55c1ccd616..7240f24de4 100644
--- a/python/tvm/contrib/cutlass/attention_operation.py
+++ b/python/tvm/contrib/cutlass/attention_operation.py
@@ -88,7 +88,7 @@ def instantiate_attention_template(attrs):
/*is_aligned=*/${kIsAligned},
/*queries_per_block=*/${kQueriesPerBlock},
/*keys_per_block=*/${kKeysPerBlock},
- /*single_value_iteration=*/${kSingleValueIteration},
+ /*kMaxK=*/${kMaxK},
/*supports_dropout=*/${kSupportsDropout},
/*supports_bias=*/${kSupportsBias}
>;
diff --git a/python/tvm/contrib/cutlass/gen_tensor_op.py
b/python/tvm/contrib/cutlass/gen_tensor_op.py
index 2988f9a8a2..0aaafe8505 100644
--- a/python/tvm/contrib/cutlass/gen_tensor_op.py
+++ b/python/tvm/contrib/cutlass/gen_tensor_op.py
@@ -729,6 +729,8 @@ def instantiate_template(func_name, annotations, func_args):
attrs["num_heads"] = n = annotations["num_heads"]
attrs["head_dim"] = h = annotations["head_dim"]
attrs["head_dim_value"] = h_v = annotations["head_dim_value"]
+ attrs["kMaxK"] = max(int(attrs["head_dim"]),
int(attrs["head_dim_value"]))
+
data_type_size = DataTypeSize[data_type]
if (data_type_size * h // 8) % 16 == 0 and (data_type_size * h_v // 8)
% 16 == 0:
attrs["kIsAligned"] = True