[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33925: GH-33923: [Docs] Tensor canonical extension type specification

via GitHub Wed, 15 Feb 2023 05:26:29 -0800


jorisvandenbossche commented on code in PR #33925:
URL: https://github.com/apache/arrow/pull/33925#discussion_r1107113880



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -72,4 +72,65 @@ same rules as laid out above, and provide backwards 
compatibility guarantees.
 Official List
 =============
 
-No canonical extension types have been standardized yet.
+Fixed shape tensor
+==================
+
+* Extension name: `arrow.fixed_shape_tensor`.
+
+* The storage type of the extension: ``FixedSizeList`` where:
+
+  * **value_type** is the data type of individual tensors and
+    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
+  * **list_size** is the product of all the elements in tensor shape.
+
+* Extension type parameters:
+
+  * **value_type** = Arrow DataType of the tensor elements
+  * **shape** = shape of the contained tensors as an array
+
+  Optional parameters:
+
+  * **dim_names** = explicit names to tensor dimensions
+    as an array. The length of it should be equal to the shape
+    length and equal to the number of dimensions.
+
+    ``dim_names`` are used if the dimensions have well-known

Review Comment:
   ```suggestion
       ``dim_names`` can be used if the dimensions have well-known
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -72,4 +72,65 @@ same rules as laid out above, and provide backwards 
compatibility guarantees.
 Official List
 =============
 
-No canonical extension types have been standardized yet.
+Fixed shape tensor
+==================
+
+* Extension name: `arrow.fixed_shape_tensor`.
+
+* The storage type of the extension: ``FixedSizeList`` where:
+
+  * **value_type** is the data type of individual tensors and
+    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
+  * **list_size** is the product of all the elements in tensor shape.
+
+* Extension type parameters:
+
+  * **value_type** = Arrow DataType of the tensor elements
+  * **shape** = shape of the contained tensors as an array
+
+  Optional parameters:
+
+  * **dim_names** = explicit names to tensor dimensions
+    as an array. The length of it should be equal to the shape
+    length and equal to the number of dimensions.
+
+    ``dim_names`` are used if the dimensions have well-known
+    names and they map to the physical layout (row-major).
+
+  * **permutation**  = indices of the desired ordering of the
+    original dimensions, defined as an array.
+
+    The indices contain a permutation of the values [0, 1, .., N-1] where
+    N is the number of dimensions. The permutation indicates which
+    dimension of the logical layout corresponds to which dimension of the
+    physical tensor (the i-th dimension of the logical view corresponds
+    to the dimension with number permutations[i] of the physical tensor).

Review Comment:
   ```suggestion
       to the dimension with number ``permutations[i]`` of the physical tensor).
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -72,4 +72,65 @@ same rules as laid out above, and provide backwards 
compatibility guarantees.
 Official List
 =============
 
-No canonical extension types have been standardized yet.
+Fixed shape tensor
+==================
+
+* Extension name: `arrow.fixed_shape_tensor`.
+
+* The storage type of the extension: ``FixedSizeList`` where:
+
+  * **value_type** is the data type of individual tensors and
+    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
+  * **list_size** is the product of all the elements in tensor shape.
+
+* Extension type parameters:
+
+  * **value_type** = Arrow DataType of the tensor elements
+  * **shape** = shape of the contained tensors as an array
+
+  Optional parameters:
+
+  * **dim_names** = explicit names to tensor dimensions
+    as an array. The length of it should be equal to the shape
+    length and equal to the number of dimensions.
+
+    ``dim_names`` are used if the dimensions have well-known
+    names and they map to the physical layout (row-major).
+
+  * **permutation**  = indices of the desired ordering of the
+    original dimensions, defined as an array.
+
+    The indices contain a permutation of the values [0, 1, .., N-1] where
+    N is the number of dimensions. The permutation indicates which
+    dimension of the logical layout corresponds to which dimension of the
+    physical tensor (the i-th dimension of the logical view corresponds
+    to the dimension with number permutations[i] of the physical tensor).
+
+    **Permutation is only needed in case the logical order of
+    the tensor is a permutation of the physical order (row-major).**
+
+    When logical and physical layout are equal, the permutation will always
+    be ([0, 1, .., N-1]) and is therefore absent. Same holds the other way

Review Comment:
   ```suggestion
       be ([0, 1, .., N-1]) and can therefore be left out. Same holds the other 
way
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -72,4 +72,65 @@ same rules as laid out above, and provide backwards 
compatibility guarantees.
 Official List
 =============
 
-No canonical extension types have been standardized yet.
+Fixed shape tensor
+==================
+
+* Extension name: `arrow.fixed_shape_tensor`.
+
+* The storage type of the extension: ``FixedSizeList`` where:
+
+  * **value_type** is the data type of individual tensors and
+    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
+  * **list_size** is the product of all the elements in tensor shape.
+
+* Extension type parameters:
+
+  * **value_type** = Arrow DataType of the tensor elements
+  * **shape** = shape of the contained tensors as an array
+
+  Optional parameters:
+
+  * **dim_names** = explicit names to tensor dimensions
+    as an array. The length of it should be equal to the shape
+    length and equal to the number of dimensions.
+
+    ``dim_names`` are used if the dimensions have well-known
+    names and they map to the physical layout (row-major).
+
+  * **permutation**  = indices of the desired ordering of the
+    original dimensions, defined as an array.
+
+    The indices contain a permutation of the values [0, 1, .., N-1] where
+    N is the number of dimensions. The permutation indicates which
+    dimension of the logical layout corresponds to which dimension of the
+    physical tensor (the i-th dimension of the logical view corresponds
+    to the dimension with number permutations[i] of the physical tensor).
+
+    **Permutation is only needed in case the logical order of
+    the tensor is a permutation of the physical order (row-major).**
+
+    When logical and physical layout are equal, the permutation will always
+    be ([0, 1, .., N-1]) and is therefore absent. Same holds the other way
+    round: if permutation parameter is absent, it is assumed that logical
+    layout matches the physical one.
+
+* Description of the serialization:
+
+  The metadata must be a valid JSON object including shape of
+  the contained tensors as an array with key **"shape"** plus optional
+  dimension names with keys **"dim_names"** and ordering of the
+  dimensions with key **"permutation"**.
+
+  - Example: ``{ "shape": [2, 5]}``
+  - Example with ``dim_names`` metadata for NCHW ordered data:
+
+    ``{ "shape": [100, 200, 500], "dim_names": ["C", "H", "W"]}``
+
+  - Example of permuted 3-dimensional tensor:
+
+    ``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}``
+
+.. note::
+
+  Elements in an fixed shape tensor extension array are stored

Review Comment:
   ```suggestion
     Elements in a fixed shape tensor extension array are stored
   ```



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -72,4 +72,65 @@ same rules as laid out above, and provide backwards 
compatibility guarantees.
 Official List
 =============
 
-No canonical extension types have been standardized yet.
+Fixed shape tensor
+==================
+
+* Extension name: `arrow.fixed_shape_tensor`.
+
+* The storage type of the extension: ``FixedSizeList`` where:
+
+  * **value_type** is the data type of individual tensors and
+    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
+  * **list_size** is the product of all the elements in tensor shape.
+
+* Extension type parameters:
+
+  * **value_type** = Arrow DataType of the tensor elements
+  * **shape** = shape of the contained tensors as an array
+
+  Optional parameters:
+
+  * **dim_names** = explicit names to tensor dimensions
+    as an array. The length of it should be equal to the shape
+    length and equal to the number of dimensions.
+
+    ``dim_names`` are used if the dimensions have well-known
+    names and they map to the physical layout (row-major).
+
+  * **permutation**  = indices of the desired ordering of the
+    original dimensions, defined as an array.
+
+    The indices contain a permutation of the values [0, 1, .., N-1] where
+    N is the number of dimensions. The permutation indicates which
+    dimension of the logical layout corresponds to which dimension of the
+    physical tensor (the i-th dimension of the logical view corresponds
+    to the dimension with number permutations[i] of the physical tensor).
+
+    **Permutation is only needed in case the logical order of
+    the tensor is a permutation of the physical order (row-major).**
+
+    When logical and physical layout are equal, the permutation will always
+    be ([0, 1, .., N-1]) and is therefore absent. Same holds the other way
+    round: if permutation parameter is absent, it is assumed that logical
+    layout matches the physical one.

Review Comment:
   I think this last sentence is a bit redundant with the previous one?



##########
docs/source/format/CanonicalExtensions.rst:
##########
@@ -72,4 +72,65 @@ same rules as laid out above, and provide backwards 
compatibility guarantees.
 Official List
 =============
 
-No canonical extension types have been standardized yet.
+Fixed shape tensor
+==================
+
+* Extension name: `arrow.fixed_shape_tensor`.
+
+* The storage type of the extension: ``FixedSizeList`` where:
+
+  * **value_type** is the data type of individual tensors and
+    is an instance of ``pyarrow.DataType`` or ``pyarrow.Field``.
+  * **list_size** is the product of all the elements in tensor shape.
+
+* Extension type parameters:
+
+  * **value_type** = Arrow DataType of the tensor elements
+  * **shape** = shape of the contained tensors as an array
+
+  Optional parameters:
+
+  * **dim_names** = explicit names to tensor dimensions
+    as an array. The length of it should be equal to the shape
+    length and equal to the number of dimensions.
+
+    ``dim_names`` are used if the dimensions have well-known
+    names and they map to the physical layout (row-major).
+
+  * **permutation**  = indices of the desired ordering of the
+    original dimensions, defined as an array.
+
+    The indices contain a permutation of the values [0, 1, .., N-1] where
+    N is the number of dimensions. The permutation indicates which
+    dimension of the logical layout corresponds to which dimension of the
+    physical tensor (the i-th dimension of the logical view corresponds
+    to the dimension with number permutations[i] of the physical tensor).
+
+    **Permutation is only needed in case the logical order of
+    the tensor is a permutation of the physical order (row-major).**

Review Comment:
   ```suggestion
       Permutation can be useful in case the logical order of
       the tensor is a permutation of the physical order (row-major).
   ```
   
   I would maybe use the softer "can be useful" instead of "is needed", because 
even in the latter case, it's not strictly "needed", you are perfectly allowed 
to just store as the physical layout, or eg use dim_names to convey this 
information.
   
   Nitpick: for the rest only keys are put in bold, so I also wouldn't put this 
in bold (it's not thé most important aspect of the whole spec)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #33925: GH-33923: [Docs] Tensor canonical extension type specification

Reply via email to