DerrickYLJ commented on code in PR #18027:
URL: https://github.com/apache/tvm/pull/18027#discussion_r2119690490


##########
src/runtime/device_api.cc:
##########
@@ -111,7 +111,7 @@ size_t DeviceAPI::GetDataSize(const DLTensor& arr, 
Optional<String> mem_scope) {
     for (int i = 0; i < arr.ndim; ++i) {
       size *= static_cast<size_t>(arr.shape[i]);
     }
-    size *= (arr.dtype.bits * arr.dtype.lanes + 7) / 8;
+    size = (size * arr.dtype.bits * arr.dtype.lanes + 7) / 8;

Review Comment:
   Yes, this calculation should be correct for fp4 and fp6. It first calculates 
**total bits** and **ceil the bytes**, making sure it rounds up to the next 
full byte. 
   
   For example, 3 elements of fp4 (4 bits each, 1 lane) would be `(3 * 4 * 1 + 
7) / 8 = (12 + 7) / 8 = 19 / 8 = 2` bytes. This is correct, as 12 bits require 
2 bytes of storage, 4 bits for padding. Same calculation for fp6. 
   
   The previous method would have calculated bytes per element (rounded up) and 
then multiplied, which isn't optimal for packed sub-byte types (e.g., it would 
allocate 1 byte per fp4 element, instead of packing two fp4 elements into a 
single byte).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to