[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33925: GH-33923: [Docs] Tensor canonical extension type specification

via GitHub Fri, 03 Feb 2023 04:14:41 -0800


jorisvandenbossche commented on code in PR #33925:
URL: https://github.com/apache/arrow/pull/33925#discussion_r1095736003



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -72,4 +72,30 @@ same rules as laid out above, and provide backwards 
compatibility guarantees.
 Official List
 =============
 
-No canonical extension types have been standardized yet.
+Fixed shape tensor
+==================
+
+* Extension name: `arrow.fixed_shape_tensor`.
+
+* The storage type of the extension: ``FixedSizeList`` where:
+
+  * **value_type** is the data type of individual tensors and
+    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
+  * **list_size** is the product of all the elements in tensor shape.
+
+* Extension type parameters:
+
+  * **value_type** = Arrow DataType of the tensor elements
+  * **shape** = shape of the contained tensors as a tuple
+  * **is_row_major** = boolean indicating the order of elements

Review Comment:
   OK, I understand (we might have been talking past each other a bit, as I was 
assuming you want to have `strides` to allow zero-copy for all cases, while I 
tried to convince you that it's not needed). 
   
   It's certainly true that we _could_ store strides, but I am not sure it 
would be a better generalization of (or a full replacement for) dimension 
names. 
   Consider for example that you have channels-last physical data (NHWC), but 
you view it as channels-first logically (NCHW). To store the data with the 
logical dimension order, this would requires a `strides` parameter. But assume 
you only store the strides in the `FixedShapeTensor` type and not the dimension 
names, then when consuming that data, you know the strides associated with it, 
but you still don't know for sure what the dimensions mean (because both NHWC 
viewed as NCHW, or NCHW viewed as NHWC would give you custom strides). Of 
course, if you know where the data is coming from and you know that it's from a 
pytorch context, then you can assume that the logical order is NCHW (that's how 
pytorch [always shows 
it](https://pytorch.org/blog/accelerating-pytorch-vision-models-with-channels-last-on-cpu/):
 "No matter what the physical order is, tensor shape and stride will always be 
depicted in the order of NCHW"), and that information combined with the strides 
ensures you know if th
 e physical layout is channels-first or last. 
   But that requires application-specific context to know that. While if you 
store the dimension names that match the order assuming a row-major layout, 
then you can infer the same information (and how to transpose it to get your 
desired logical order), but without requiring this application-specific 
knowledge (assuming different applications would use consistent dimension 
names, so you can recognize those).
   
   So my current understanding is that dimension names are the more 
generalizable information. 
   
   In addition, pushing the strides logic (how to translate the given dimension 
order and your desired dimension order to strides) to the application to deal 
with, keeps the implementation of the FixedShapeTensorType itself simpler, not 
requiring every implementation to deal with custom strides. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33925: GH-33923: [Docs] Tensor canonical extension type specification

Reply via email to