Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-24 Thread Rok Mihevc
Thanks for all the input and feedback so far!
I went ahead and added unifom_shape parameter with the second notation
option to the proposal [1].

More discussion could be valuable here so I'd like to call the vote on
Wednesday.

[1]
https://github.com/apache/arrow/pull/37166/files/9c827a0ba54280f4695202e17e32902986c4f12f#diff-b54425cb176b53e51925c13a4d4e85cf7d03d4e1226e6d5bf4d7ae09923db8b3

Best,
Rok

On Sat, Sep 16, 2023 at 3:11 PM Rok Mihevc  wrote:

> I agree, the increased complexity is probably not worth the savings
> from keeping only shapes of ragged dimensions.
> However I still think it could be valuable to optionally store uniform
> dimension shape as metadata. So:
>
> **uniform_shape** = Sizes of all contained tensors in their uniform
> dimensions.
>
> Given we have a series of 3 dimensional image tensors  [H, W, C] with
> variable width (uniform_dimenions = [0, 2]):
>
> Notation option 1:
> **uniform_shape** = [400,3]
>
> Notation option 2:
> **uniform_shape** = [400,0,3]
>
> Best,
> Rok
>
> On Sat, Sep 16, 2023 at 4:34 AM Jeremy Leibs  wrote:
> >
> > On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc  wrote:
> >
> > >
> > > How about also changing shape and adding uniform_shape like so:
> > > """
> > > **shape** is a ``FixedSizeList[ndim_ragged]`` of ragged shape
> > > of each tensor contained in ``data`` where the size of the list
> > > ``ndim_ragged`` is equal to the number of dimensions of tensor
> > > subtracted by the number of ragged dimensions.
> > > [..]
> > > **uniform_shape**
> > > Sizes of all contained tensors in their uniform dimensions.
> > > """
> > >
> > > This would make shape array smaller (in width) if more uniform
> > > dimensions were provided. However it would increase the complexity of
> > > the extension type a little bit.
> > >
> > >
> > This trade-off doesn't seem  worthwhile to me.
> >  - Shape array will almost always be dramatically smaller than the tensor
> > data itself, so the space savings are unlikely to be meaningful in
> practice.
> >  - On the other hand, coding up the index offset math for a sparsely
> > represented shape with implicitly interleaved uniform dimensions is much
> > more error prone (and less efficient).
> >  - Even just consider answering a simple question like "What is the size
> of
> > dimension N":
> >
> > If `shape` always contains all the dimensions, this is trivially
> `shape[N]`
> > (or `shape[permuations[N]]` if permutations was specified.)
> >
> > On the other hand, if `shape` only contains the ragged/variable
> dimensions
> > this lookup instead becomes something like:
> > ```
> > offset = count(uniform_dimensions < N)
> > shape[N - offset]
> > ```
> >
> > Maybe this doesn't seem too bad at first, but does everyone implement
> this
> > as count()? Does someone implement it as
> > `find_lower_bound(uniform_dimension, N)`? Did they validate that
> > `uniform_dimensions` was specified as a sorted list?
> >
> > Now for added risk of errors, consider how this interacts with the
> > `permuation`... in my opinion there is way too much thinking required to
> > figure out if the correct value is: `shape[permutations[N] - offset]` or
> > `shape[permutations[N - offset]]`.
> >
> > Arrow design guidance typically skews heavily in favor of efficient
> > deterministic access over maximally space-efficient representations.
> >
> > Best,
> > Jeremy
>


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-16 Thread Rok Mihevc
I agree, the increased complexity is probably not worth the savings
from keeping only shapes of ragged dimensions.
However I still think it could be valuable to optionally store uniform
dimension shape as metadata. So:

**uniform_shape** = Sizes of all contained tensors in their uniform dimensions.

Given we have a series of 3 dimensional image tensors  [H, W, C] with
variable width (uniform_dimenions = [0, 2]):

Notation option 1:
**uniform_shape** = [400,3]

Notation option 2:
**uniform_shape** = [400,0,3]

Best,
Rok

On Sat, Sep 16, 2023 at 4:34 AM Jeremy Leibs  wrote:
>
> On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc  wrote:
>
> >
> > How about also changing shape and adding uniform_shape like so:
> > """
> > **shape** is a ``FixedSizeList[ndim_ragged]`` of ragged shape
> > of each tensor contained in ``data`` where the size of the list
> > ``ndim_ragged`` is equal to the number of dimensions of tensor
> > subtracted by the number of ragged dimensions.
> > [..]
> > **uniform_shape**
> > Sizes of all contained tensors in their uniform dimensions.
> > """
> >
> > This would make shape array smaller (in width) if more uniform
> > dimensions were provided. However it would increase the complexity of
> > the extension type a little bit.
> >
> >
> This trade-off doesn't seem  worthwhile to me.
>  - Shape array will almost always be dramatically smaller than the tensor
> data itself, so the space savings are unlikely to be meaningful in practice.
>  - On the other hand, coding up the index offset math for a sparsely
> represented shape with implicitly interleaved uniform dimensions is much
> more error prone (and less efficient).
>  - Even just consider answering a simple question like "What is the size of
> dimension N":
>
> If `shape` always contains all the dimensions, this is trivially `shape[N]`
> (or `shape[permuations[N]]` if permutations was specified.)
>
> On the other hand, if `shape` only contains the ragged/variable dimensions
> this lookup instead becomes something like:
> ```
> offset = count(uniform_dimensions < N)
> shape[N - offset]
> ```
>
> Maybe this doesn't seem too bad at first, but does everyone implement this
> as count()? Does someone implement it as
> `find_lower_bound(uniform_dimension, N)`? Did they validate that
> `uniform_dimensions` was specified as a sorted list?
>
> Now for added risk of errors, consider how this interacts with the
> `permuation`... in my opinion there is way too much thinking required to
> figure out if the correct value is: `shape[permutations[N] - offset]` or
> `shape[permutations[N - offset]]`.
>
> Arrow design guidance typically skews heavily in favor of efficient
> deterministic access over maximally space-efficient representations.
>
> Best,
> Jeremy


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-15 Thread Jeremy Leibs
On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc  wrote:

>
> How about also changing shape and adding uniform_shape like so:
> """
> **shape** is a ``FixedSizeList[ndim_ragged]`` of ragged shape
> of each tensor contained in ``data`` where the size of the list
> ``ndim_ragged`` is equal to the number of dimensions of tensor
> subtracted by the number of ragged dimensions.
> [..]
> **uniform_shape**
> Sizes of all contained tensors in their uniform dimensions.
> """
>
> This would make shape array smaller (in width) if more uniform
> dimensions were provided. However it would increase the complexity of
> the extension type a little bit.
>
>
This trade-off doesn't seem  worthwhile to me.
 - Shape array will almost always be dramatically smaller than the tensor
data itself, so the space savings are unlikely to be meaningful in practice.
 - On the other hand, coding up the index offset math for a sparsely
represented shape with implicitly interleaved uniform dimensions is much
more error prone (and less efficient).
 - Even just consider answering a simple question like "What is the size of
dimension N":

If `shape` always contains all the dimensions, this is trivially `shape[N]`
(or `shape[permuations[N]]` if permutations was specified.)

On the other hand, if `shape` only contains the ragged/variable dimensions
this lookup instead becomes something like:
```
offset = count(uniform_dimensions < N)
shape[N - offset]
```

Maybe this doesn't seem too bad at first, but does everyone implement this
as count()? Does someone implement it as
`find_lower_bound(uniform_dimension, N)`? Did they validate that
`uniform_dimensions` was specified as a sorted list?

Now for added risk of errors, consider how this interacts with the
`permuation`... in my opinion there is way too much thinking required to
figure out if the correct value is: `shape[permutations[N] - offset]` or
`shape[permutations[N - offset]]`.

Arrow design guidance typically skews heavily in favor of efficient
deterministic access over maximally space-efficient representations.

Best,
Jeremy


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-15 Thread Rok Mihevc
First, thanks for all the input!

On Wed, Sep 13, 2023 at 6:27 AM Alenka Frim
 wrote:
> In the PR you mention that "this [ragged dimensions] would be purely
> metadata that would help converting arrow <-> jagged/ragged". Are there any
> examples available to better understand this metadata and how it would be
> used in the conversion you mention?

After checking tf [1] / pytorch [2] I now believe List[type] would
strictly be enough to express ragged/jagged. However, storing
ragged/uniform dimensions would probably still be valuable for other
purposes.

[1] 
https://github.com/tensorflow/tensorflow/blob/v2.13.0/tensorflow/python/ops/ragged/ragged_tensor.py#L65-L305
[2] 
https://github.com/pytorch/torchrec/blob/cdd9f20cc6cf090c0d5fc91a01d45723905525e1/torchrec/sparse/jagged_tensor.py#L185-L225

On Wed, Sep 13, 2023 at 11:25 AM Antoine Pitrou  wrote:
> It's a bit confusing that an empty list means "no ragged dimensions" but
> a missing entry means "all dimensions are ragged". This seems
> error-prone to me.

Good point, I like Jeremy's proposal to address this.

> Also, to be clear, "ragged_dimensions" is only useful for data validation?

As the proposal stands, yes. I would like to amend it, see below.

On Wed, Sep 13, 2023 at 11:41 PM Jeremy Leibs  wrote:
> I would propose instead:
>
> **uniform_dimenions** = Indices of dimensions whose sizes are guaranteed to
> remain constant.
> Indices are a subset of all possible dimension indices ([0, 1, .., N-1]).
> The uniform dimensions must
> still be represented in the `shape` field, and must always be the same
> value for all tensors in the
> array -- this allows code to interpret the tensor correctly without
> accounting for uniform dimensions
> while still permitting optional optimizations that take advantage of the
> uniformity. Uniform_dimensions
> can be left out, in which case it is assumed that all dimensions might be
> variable.

I prefer this to the current proposal!

> Please consider adding some wording and an example such as:
> [..]

Will do.


How about also changing shape and adding uniform_shape like so:
"""
**shape** is a ``FixedSizeList[ndim_ragged]`` of ragged shape
of each tensor contained in ``data`` where the size of the list
``ndim_ragged`` is equal to the number of dimensions of tensor
subtracted by the number of ragged dimensions.
[..]
**uniform_shape**
Sizes of all contained tensors in their uniform dimensions.
"""

This would make shape array smaller (in width) if more uniform
dimensions were provided. However it would increase the complexity of
the extension type a little bit.

Best,
Rok


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-13 Thread Jeremy Leibs
Additionally, after reviewing, I also think the introduction of
permutations requires a bit more clarification.

Please consider adding some wording and an example such as:

With the exception of the permutation parameter, all other lists and
storage within the Tensor and the extension parameters define
the *physical* storage of the tensor.

For example, consider a Tensor with:
  shape = [10, 20, 30]
  dim_names = [x, y, z]
  permutations = [2, 0, 1]

This means the logical tensor has names [z, x, y] and shape [30, 10, 20].

Other than that, looks great! Thanks for working on this.
-Jeremy

On Wed, Sep 13, 2023 at 2:38 AM Rok Mihevc  wrote:

> After some discussion on the PR [
> https://github.com/apache/arrow/pull/37166]
> we've altered the proposed type by removing the ndim parameter and
> adding ragged_dimensions one.
> If there is no further feedback I'd like to call for a vote early next
> week. Proposed language now reads:
>
> Variable shape tensor
> =
>
> * Extension name: `arrow.variable_shape_tensor`.
>
> * The storage type of the extension is: ``StructArray`` where struct
>   is composed of **data** and **shape** fields describing a single
>   tensor per row:
>
>   * **data** is a ``List`` holding tensor elements of a single tensor.
> Data type of the list elements is uniform across the entire column
> and also provided in metadata.
>   * **shape** is a ``FixedSizeList[ndim]`` of the tensor shape
> where
> the size of the list ``ndim`` is equal to the number of dimensions of
> the
> tensor.
>
> * Extension type parameters:
>
>   * **value_type** = the Arrow data type of individual tensor elements.
>
>   Optional parameters describing the logical layout:
>
>   * **dim_names** = explicit names to tensor dimensions
> as an array. The length of it should be equal to the shape
> length and equal to the number of dimensions.
>
> ``dim_names`` can be used if the dimensions have well-known
> names and they map to the physical layout (row-major).
>
>   * **permutation**  = indices of the desired ordering of the
> original dimensions, defined as an array.
>
> The indices contain a permutation of the values [0, 1, .., N-1] where
> N is the number of dimensions. The permutation indicates which
> dimension of the logical layout corresponds to which dimension of the
> physical tensor (the i-th dimension of the logical view corresponds
> to the dimension with number ``permutations[i]`` of the physical
> tensor).
>
> Permutation can be useful in case the logical order of
> the tensor is a permutation of the physical order (row-major).
>
> When logical and physical layout are equal, the permutation will always
> be ([0, 1, .., N-1]) and can therefore be left out.
>
>   * **ragged_dimensions** = indices of ragged dimensions whose sizes may
> differ. Dimensions where all elements have the same size are called
> uniform dimensions. Indices are a subset of all possible dimension
> indices ([0, 1, .., N-1]).
> Ragged dimensions list can be left out. In that case all dimensions
> are assumed ragged.
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including number of
>   dimensions of the contained tensors as an integer with key **"ndim"**
>   plus optional dimension names with keys **"dim_names"** and ordering of
>   the dimensions with key **"permutation"**.
>
>   - Example with ``dim_names`` metadata for NCHW ordered data:
>
> ``{ "dim_names": ["C", "H", "W"] }``
>
>   - Example with ``ragged_dimensions`` metadata for a set of color images
> with variable width:
>
> ``{ "dim_names": ["H", "W", "C"], "ragged_dimensions": [1] }``
>
>   - Example of permuted 3-dimensional tensor:
>
> ``{ "permutation": [2, 0, 1] }``
>
> This is the physical layout shape and the shape of the logical
> layout would given an individual tensor of shape [100, 200, 500]
> be ``[500, 100, 200]``.
>
> .. note::
>
>   Elements in a variable shape tensor extension array are stored
>   in row-major/C-contiguous order.
>
>
> Rok
>


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-13 Thread Jeremy Leibs
On Wed, Sep 13, 2023 at 8:38 AM Antoine Pitrou  wrote:

>
> Le 13/09/2023 à 02:37, Rok Mihevc a écrit :
> >
> >* **ragged_dimensions** = indices of ragged dimensions whose sizes may
> >  differ. Dimensions where all elements have the same size are called
> >  uniform dimensions. Indices are a subset of all possible dimension
> >  indices ([0, 1, .., N-1]).
> >  Ragged dimensions list can be left out. In that case all dimensions
> >  are assumed ragged.
>
> It's a bit confusing that an empty list means "no ragged dimensions" but
> a missing entry means "all dimensions are ragged". This seems
> error-prone to me.
>
> Also, to be clear, "ragged_dimensions" is only useful for data validation?
>
>
I am also quite confused by how to interpret / use ragged dimensions. Given
that this is a "variable" shaped tensor, I personally find specifying the
exceptional case -- the "uniform" dimensions -- to be much more clear.

I would propose instead:

**uniform_dimenions** = Indices of dimensions whose sizes are guaranteed to
remain constant.
Indices are a subset of all possible dimension indices ([0, 1, .., N-1]).
The uniform dimensions must
still be represented in the `shape` field, and must always be the same
value for all tensors in the
array -- this allows code to interpret the tensor correctly without
accounting for uniform dimensions
while still permitting optional optimizations that take advantage of the
uniformity. Uniform_dimensions
can be left out, in which case it is assumed that all dimensions might be
variable.


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-13 Thread Antoine Pitrou



Le 13/09/2023 à 02:37, Rok Mihevc a écrit :


   * **ragged_dimensions** = indices of ragged dimensions whose sizes may
 differ. Dimensions where all elements have the same size are called
 uniform dimensions. Indices are a subset of all possible dimension
 indices ([0, 1, .., N-1]).
 Ragged dimensions list can be left out. In that case all dimensions
 are assumed ragged.


It's a bit confusing that an empty list means "no ragged dimensions" but 
a missing entry means "all dimensions are ragged". This seems 
error-prone to me.


Also, to be clear, "ragged_dimensions" is only useful for data validation?

Regards

Antoine.


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-12 Thread Alenka Frim
Hi all,

Thank you Rok for all your valuable work on the Arrow tensors!
I think the proposed spec and implementation are good and I have no
comments on that.

In the PR you mention that "this [ragged dimensions] would be purely
metadata that would help converting arrow <-> jagged/ragged". Are there any
examples available to better understand this metadata and how it would be
used in the conversion you mention?

Thanks!
Alenka

On Wed, Sep 13, 2023 at 2:38 AM Rok Mihevc  wrote:

> After some discussion on the PR [
> https://github.com/apache/arrow/pull/37166]
> we've altered the proposed type by removing the ndim parameter and
> adding ragged_dimensions one.
> If there is no further feedback I'd like to call for a vote early next
> week. Proposed language now reads:
>
> Variable shape tensor
> =
>
> * Extension name: `arrow.variable_shape_tensor`.
>
> * The storage type of the extension is: ``StructArray`` where struct
>   is composed of **data** and **shape** fields describing a single
>   tensor per row:
>
>   * **data** is a ``List`` holding tensor elements of a single tensor.
> Data type of the list elements is uniform across the entire column
> and also provided in metadata.
>   * **shape** is a ``FixedSizeList[ndim]`` of the tensor shape
> where
> the size of the list ``ndim`` is equal to the number of dimensions of
> the
> tensor.
>
> * Extension type parameters:
>
>   * **value_type** = the Arrow data type of individual tensor elements.
>
>   Optional parameters describing the logical layout:
>
>   * **dim_names** = explicit names to tensor dimensions
> as an array. The length of it should be equal to the shape
> length and equal to the number of dimensions.
>
> ``dim_names`` can be used if the dimensions have well-known
> names and they map to the physical layout (row-major).
>
>   * **permutation**  = indices of the desired ordering of the
> original dimensions, defined as an array.
>
> The indices contain a permutation of the values [0, 1, .., N-1] where
> N is the number of dimensions. The permutation indicates which
> dimension of the logical layout corresponds to which dimension of the
> physical tensor (the i-th dimension of the logical view corresponds
> to the dimension with number ``permutations[i]`` of the physical
> tensor).
>
> Permutation can be useful in case the logical order of
> the tensor is a permutation of the physical order (row-major).
>
> When logical and physical layout are equal, the permutation will always
> be ([0, 1, .., N-1]) and can therefore be left out.
>
>   * **ragged_dimensions** = indices of ragged dimensions whose sizes may
> differ. Dimensions where all elements have the same size are called
> uniform dimensions. Indices are a subset of all possible dimension
> indices ([0, 1, .., N-1]).
> Ragged dimensions list can be left out. In that case all dimensions
> are assumed ragged.
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including number of
>   dimensions of the contained tensors as an integer with key **"ndim"**
>   plus optional dimension names with keys **"dim_names"** and ordering of
>   the dimensions with key **"permutation"**.
>
>   - Example with ``dim_names`` metadata for NCHW ordered data:
>
> ``{ "dim_names": ["C", "H", "W"] }``
>
>   - Example with ``ragged_dimensions`` metadata for a set of color images
> with variable width:
>
> ``{ "dim_names": ["H", "W", "C"], "ragged_dimensions": [1] }``
>
>   - Example of permuted 3-dimensional tensor:
>
> ``{ "permutation": [2, 0, 1] }``
>
> This is the physical layout shape and the shape of the logical
> layout would given an individual tensor of shape [100, 200, 500]
> be ``[500, 100, 200]``.
>
> .. note::
>
>   Elements in a variable shape tensor extension array are stored
>   in row-major/C-contiguous order.
>
>
> Rok
>


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-12 Thread Rok Mihevc
After some discussion on the PR [https://github.com/apache/arrow/pull/37166]
we've altered the proposed type by removing the ndim parameter and
adding ragged_dimensions one.
If there is no further feedback I'd like to call for a vote early next
week. Proposed language now reads:

Variable shape tensor
=

* Extension name: `arrow.variable_shape_tensor`.

* The storage type of the extension is: ``StructArray`` where struct
  is composed of **data** and **shape** fields describing a single
  tensor per row:

  * **data** is a ``List`` holding tensor elements of a single tensor.
Data type of the list elements is uniform across the entire column
and also provided in metadata.
  * **shape** is a ``FixedSizeList[ndim]`` of the tensor shape where
the size of the list ``ndim`` is equal to the number of dimensions of
the
tensor.

* Extension type parameters:

  * **value_type** = the Arrow data type of individual tensor elements.

  Optional parameters describing the logical layout:

  * **dim_names** = explicit names to tensor dimensions
as an array. The length of it should be equal to the shape
length and equal to the number of dimensions.

``dim_names`` can be used if the dimensions have well-known
names and they map to the physical layout (row-major).

  * **permutation**  = indices of the desired ordering of the
original dimensions, defined as an array.

The indices contain a permutation of the values [0, 1, .., N-1] where
N is the number of dimensions. The permutation indicates which
dimension of the logical layout corresponds to which dimension of the
physical tensor (the i-th dimension of the logical view corresponds
to the dimension with number ``permutations[i]`` of the physical
tensor).

Permutation can be useful in case the logical order of
the tensor is a permutation of the physical order (row-major).

When logical and physical layout are equal, the permutation will always
be ([0, 1, .., N-1]) and can therefore be left out.

  * **ragged_dimensions** = indices of ragged dimensions whose sizes may
differ. Dimensions where all elements have the same size are called
uniform dimensions. Indices are a subset of all possible dimension
indices ([0, 1, .., N-1]).
Ragged dimensions list can be left out. In that case all dimensions
are assumed ragged.

* Description of the serialization:

  The metadata must be a valid JSON object including number of
  dimensions of the contained tensors as an integer with key **"ndim"**
  plus optional dimension names with keys **"dim_names"** and ordering of
  the dimensions with key **"permutation"**.

  - Example with ``dim_names`` metadata for NCHW ordered data:

``{ "dim_names": ["C", "H", "W"] }``

  - Example with ``ragged_dimensions`` metadata for a set of color images
with variable width:

``{ "dim_names": ["H", "W", "C"], "ragged_dimensions": [1] }``

  - Example of permuted 3-dimensional tensor:

``{ "permutation": [2, 0, 1] }``

This is the physical layout shape and the shape of the logical
layout would given an individual tensor of shape [100, 200, 500]
be ``[500, 100, 200]``.

.. note::

  Elements in a variable shape tensor extension array are stored
  in row-major/C-contiguous order.


Rok


Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-09-01 Thread Dewey Dunnington
Thank you for proposing this! I left a comment on the PR as well, but
I'm excited for this to standardize a few concepts that I have run
into whilst working on ADBC and GeoArrow:

- Properly returning an array with >1 dimension from the PostgreSQL ADBC driver
- As the basis for encoding raster tiles as rows in a table (e.g.,
http://www.geopackage.org/spec/#_tile_matrix_introduction )

Excited to see the PR progress!

-dewey

On Thu, Aug 17, 2023 at 9:54 AM Rok Mihevc  wrote:
>
> Hey all!
>
>
> Besides the recently added FixedShapeTensor [1] canonical extension type
> there appears to be a need for an already proposed VariableShapeTensor
> [2]. VariableShapeTensor
> would store tensors of variable shapes but uniform number of
> dimensions, dimension names and dimension permutations.
>
> There are examples of such types: Ray implements
> ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4].
>
> I propose we discuss adding the below text to
> format/CanonicalExtensions.rst to read as [5] and a C++/Python
> implementation as proposed in [6]. A vote can be called after a discussion
> here.
>
> Variable shape tensor
>
> =
>
> * Extension name: `arrow.variable_shape_tensor`.
>
> * The storage type of the extension is: ``StructArray`` where struct
>
>   is composed of **data** and **shape** fields describing a single
>
>   tensor per row:
>
>   * **data** is a ``List`` holding tensor elements of a single tensor.
>
> Data type of the list elements is uniform across the entire column
>
> and also provided in metadata.
>
>   * **shape** is a ``FixedSizeList`` of the tensor shape where
>
> the size of the list is equal to the number of dimensions of the
>
> tensor.
>
> * Extension type parameters:
>
>   * **value_type** = the Arrow data type of individual tensor elements.
>
>   * **ndim** = the number of dimensions of the tensor.
>
>   Optional parameters describing the logical layout:
>
>   * **dim_names** = explicit names to tensor dimensions
>
> as an array. The length of it should be equal to the shape
>
> length and equal to the number of dimensions.
>
> ``dim_names`` can be used if the dimensions have well-known
>
> names and they map to the physical layout (row-major).
>
>   * **permutation**  = indices of the desired ordering of the
>
> original dimensions, defined as an array.
>
> The indices contain a permutation of the values [0, 1, .., N-1] where
>
> N is the number of dimensions. The permutation indicates which
>
> dimension of the logical layout corresponds to which dimension of the
>
> physical tensor (the i-th dimension of the logical view corresponds
>
> to the dimension with number ``permutations[i]`` of the physical
> tensor).
>
> Permutation can be useful in case the logical order of
>
> the tensor is a permutation of the physical order (row-major).
>
> When logical and physical layout are equal, the permutation will always
>
> be ([0, 1, .., N-1]) and can therefore be left out.
>
> * Description of the serialization:
>
>   The metadata must be a valid JSON object including number of
>
>   dimensions of the contained tensors as an integer with key **"ndim"**
>
>   plus optional dimension names with keys **"dim_names"** and ordering of
>
>   the dimensions with key **"permutation"**.
>
>   - Example: ``{ "ndim": 2}``
>
>   - Example with ``dim_names`` metadata for NCHW ordered data:
>
> ``{ "ndim": 3, "dim_names": ["C", "H", "W"]}``
>
>   - Example of permuted 3-dimensional tensor:
>
> ``{ "ndim": 3, "permutation": [2, 0, 1]}``
>
> This is the physical layout shape and the shape of the logical
>
> layout would given an individual tensor of shape [100, 200, 500]
>
> be ``[500, 100, 200]``.
>
> .. note::
>
>   Elements in a variable shape tensor extension array are stored
>
>   in row-major/C-contiguous order.
>
>
> [1] https://github.com/apache/arrow/issues/33924
>
> [2] https://github.com/apache/arrow/issues/24868
>
> [3]
> https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809
>
> [4] https://pytorch.org/docs/stable/nested.html
>
> [5]
> https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor
>
> [6] https://github.com/apache/arrow/pull/37166
>
>
>
> Best,
>
> Rok


[DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type

2023-08-17 Thread Rok Mihevc
Hey all!


Besides the recently added FixedShapeTensor [1] canonical extension type
there appears to be a need for an already proposed VariableShapeTensor
[2]. VariableShapeTensor
would store tensors of variable shapes but uniform number of
dimensions, dimension names and dimension permutations.

There are examples of such types: Ray implements
ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4].

I propose we discuss adding the below text to
format/CanonicalExtensions.rst to read as [5] and a C++/Python
implementation as proposed in [6]. A vote can be called after a discussion
here.

Variable shape tensor

=

* Extension name: `arrow.variable_shape_tensor`.

* The storage type of the extension is: ``StructArray`` where struct

  is composed of **data** and **shape** fields describing a single

  tensor per row:

  * **data** is a ``List`` holding tensor elements of a single tensor.

Data type of the list elements is uniform across the entire column

and also provided in metadata.

  * **shape** is a ``FixedSizeList`` of the tensor shape where

the size of the list is equal to the number of dimensions of the

tensor.

* Extension type parameters:

  * **value_type** = the Arrow data type of individual tensor elements.

  * **ndim** = the number of dimensions of the tensor.

  Optional parameters describing the logical layout:

  * **dim_names** = explicit names to tensor dimensions

as an array. The length of it should be equal to the shape

length and equal to the number of dimensions.

``dim_names`` can be used if the dimensions have well-known

names and they map to the physical layout (row-major).

  * **permutation**  = indices of the desired ordering of the

original dimensions, defined as an array.

The indices contain a permutation of the values [0, 1, .., N-1] where

N is the number of dimensions. The permutation indicates which

dimension of the logical layout corresponds to which dimension of the

physical tensor (the i-th dimension of the logical view corresponds

to the dimension with number ``permutations[i]`` of the physical
tensor).

Permutation can be useful in case the logical order of

the tensor is a permutation of the physical order (row-major).

When logical and physical layout are equal, the permutation will always

be ([0, 1, .., N-1]) and can therefore be left out.

* Description of the serialization:

  The metadata must be a valid JSON object including number of

  dimensions of the contained tensors as an integer with key **"ndim"**

  plus optional dimension names with keys **"dim_names"** and ordering of

  the dimensions with key **"permutation"**.

  - Example: ``{ "ndim": 2}``

  - Example with ``dim_names`` metadata for NCHW ordered data:

``{ "ndim": 3, "dim_names": ["C", "H", "W"]}``

  - Example of permuted 3-dimensional tensor:

``{ "ndim": 3, "permutation": [2, 0, 1]}``

This is the physical layout shape and the shape of the logical

layout would given an individual tensor of shape [100, 200, 500]

be ``[500, 100, 200]``.

.. note::

  Elements in a variable shape tensor extension array are stored

  in row-major/C-contiguous order.


[1] https://github.com/apache/arrow/issues/33924

[2] https://github.com/apache/arrow/issues/24868

[3]
https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809

[4] https://pytorch.org/docs/stable/nested.html

[5]
https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor

[6] https://github.com/apache/arrow/pull/37166



Best,

Rok