yzh119 commented on code in PR #15064:
URL: https://github.com/apache/tvm/pull/15064#discussion_r1225188095


##########
src/runtime/relax_vm/lm_support.cc:
##########
@@ -167,12 +168,56 @@ class AttentionKVCache : public ObjectRef {
 
 TVM_REGISTER_OBJECT_TYPE(AttentionKVCacheObj);
 
+/*!
+ * \brief Create multiple kv caches with same shape, from single memory 
allocation.
+ * \param init_data The initial data to put into the cache. Ignored if 
init_fill_count is
+ *        less than 0.
+ * \param reserve_shape The shape of cache.
+ * \param init_fill_count The initial row to fill into
+ *        the cache.
+ * \param num_caches Number of caches to create.
+ */
+Array<AttentionKVCache> CreateMultipleKVCaches(NDArray init_data, ShapeTuple 
reserve_shape,
+                                               int init_fill_count, int 
num_caches) {
+  DLDataType dtype = init_data->dtype;
+
+  int64_t cache_size = (dtype.bits * dtype.lanes + 7) / 8;

Review Comment:
   So currently the dtype is smaller than one byte, then we would pad it to one 
byte, is that correct?
   FYI: [Flexgen](https://arxiv.org/abs/2303.06865) uses 4-bit KV cache, we can 
support it later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to