This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-14.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 1e417cb17486dc29be9f4c0a345d595a4b182830
Author: Rok Mihevc <[email protected]>
AuthorDate: Wed Oct 11 16:18:04 2023 +0200

    GH-24868: [C++] Add a Tensor logical value type with varying dimensions, 
implemented using ExtensionType (#37166)
    
    ### Rationale for this change
    
    For use cases where underlying datatype and number of dimensions in tensors 
are equal but not the actual shape we want to add a `VariableShapeTensorType`.
    See https://github.com/apache/arrow/issues/24868 and 
https://github.com/huggingface/datasets/issues/5272
    
    ### What changes are included in this PR?
    
    This introduces definition of `arrow.variable_shape_tensor` extension and 
it's C++ implementation and a Python wrapper.
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    This introduces new extension type to the user.
    * Closes: #24868
    
    Lead-authored-by: Rok Mihevc <[email protected]>
    Co-authored-by: Joris Van den Bossche <[email protected]>
    Co-authored-by: Antoine Pitrou <[email protected]>
    Signed-off-by: Joris Van den Bossche <[email protected]>
---
 docs/source/format/CanonicalExtensions.rst | 103 +++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/docs/source/format/CanonicalExtensions.rst 
b/docs/source/format/CanonicalExtensions.rst
index 9f7948cbfe..084b6e6289 100644
--- a/docs/source/format/CanonicalExtensions.rst
+++ b/docs/source/format/CanonicalExtensions.rst
@@ -148,6 +148,109 @@ Fixed shape tensor
   by this specification. Instead, this extension type lets one use fixed shape 
tensors
   as elements in a field of a RecordBatch or a Table.
 
+.. _variable_shape_tensor_extension:
+
+Variable shape tensor
+=====================
+
+* Extension name: `arrow.variable_shape_tensor`.
+
+* The storage type of the extension is: ``StructArray`` where struct
+  is composed of **data** and **shape** fields describing a single
+  tensor per row:
+
+  * **data** is a ``List`` holding tensor elements (each list element is
+    a single tensor). The List's value type is the value type of the tensor,
+    such as an integer or floating-point type.
+  * **shape** is a ``FixedSizeList<int32>[ndim]`` of the tensor shape where
+    the size of the list ``ndim`` is equal to the number of dimensions of the
+    tensor.
+
+* Extension type parameters:
+
+  * **value_type** = the Arrow data type of individual tensor elements.
+
+  Optional parameters describing the logical layout:
+
+  * **dim_names** = explicit names to tensor dimensions
+    as an array. The length of it should be equal to the shape
+    length and equal to the number of dimensions.
+
+    ``dim_names`` can be used if the dimensions have well-known
+    names and they map to the physical layout (row-major).
+
+  * **permutation**  = indices of the desired ordering of the
+    original dimensions, defined as an array.
+
+    The indices contain a permutation of the values [0, 1, .., N-1] where
+    N is the number of dimensions. The permutation indicates which
+    dimension of the logical layout corresponds to which dimension of the
+    physical tensor (the i-th dimension of the logical view corresponds
+    to the dimension with number ``permutations[i]`` of the physical tensor).
+
+    Permutation can be useful in case the logical order of
+    the tensor is a permutation of the physical order (row-major).
+
+    When logical and physical layout are equal, the permutation will always
+    be ([0, 1, .., N-1]) and can therefore be left out.
+
+  * **uniform_shape** = sizes of individual tensor's dimensions which are
+    guaranteed to stay constant in uniform dimensions and can vary in
+    non-uniform dimensions. This holds over all tensors in the array.
+    Sizes in uniform dimensions are represented with int32 values, while
+    sizes of the non-uniform dimensions are not known in advance and are
+    represented with null. If ``uniform_shape`` is not provided it is assumed
+    that all dimensions are non-uniform.
+    An array containing a tensor with shape (2, 3, 4) and whose first and
+    last dimensions are uniform would have ``uniform_shape`` (2, null, 4).
+    This allows for interpreting the tensor correctly without accounting for
+    uniform dimensions while still permitting optional optimizations that
+    take advantage of the uniformity.
+
+* Description of the serialization:
+
+  The metadata must be a valid JSON object that optionally includes
+  dimension names with keys **"dim_names"** and  ordering of dimensions
+  with key **"permutation"**.
+  Shapes of tensors can be defined in a subset of dimensions by providing
+  key **"uniform_shape"**.
+  Minimal metadata is an empty string.
+
+  - Example with ``dim_names`` metadata for NCHW ordered data (note that the 
first
+    logical dimension, ``N``, is mapped to the **data** List array: each 
element in the List
+    is a CHW tensor and the List of tensors implicitly constitutes a single 
NCHW tensor):
+
+    ``{ "dim_names": ["C", "H", "W"] }``
+
+  - Example with ``uniform_shape`` metadata for a set of color images
+    with fixed height, variable width and three color channels:
+
+    ``{ "dim_names": ["H", "W", "C"], "uniform_shape": [400, null, 3] }``
+
+  - Example of permuted 3-dimensional tensor:
+
+    ``{ "permutation": [2, 0, 1] }``
+
+    For example, if the physical **shape** of an individual tensor
+    is ``[100, 200, 500]``, this permutation would denote a logical shape
+    of ``[500, 100, 200]``.
+
+.. note::
+
+  With the exception of ``permutation``, the parameters and storage
+  of VariableShapeTensor relate to the *physical* storage of the tensor.
+
+  For example, consider a tensor with::
+    shape = [10, 20, 30]
+    dim_names = [x, y, z]
+    permutations = [2, 0, 1]
+
+  This means the logical tensor has names [z, x, y] and shape [30, 10, 20].
+
+.. note::
+   Values inside each **data** tensor element are stored in 
row-major/C-contiguous
+   order according to the corresponding **shape**.
+
 =========================
 Community Extension Types
 =========================

Reply via email to