Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
Thanks for all the input and feedback so far! I went ahead and added unifom_shape parameter with the second notation option to the proposal [1]. More discussion could be valuable here so I'd like to call the vote on Wednesday. [1] https://github.com/apache/arrow/pull/37166/files/9c827a0ba54280f4695202e17e32902986c4f12f#diff-b54425cb176b53e51925c13a4d4e85cf7d03d4e1226e6d5bf4d7ae09923db8b3 Best, Rok On Sat, Sep 16, 2023 at 3:11 PM Rok Mihevc wrote: > I agree, the increased complexity is probably not worth the savings > from keeping only shapes of ragged dimensions. > However I still think it could be valuable to optionally store uniform > dimension shape as metadata. So: > > **uniform_shape** = Sizes of all contained tensors in their uniform > dimensions. > > Given we have a series of 3 dimensional image tensors [H, W, C] with > variable width (uniform_dimenions = [0, 2]): > > Notation option 1: > **uniform_shape** = [400,3] > > Notation option 2: > **uniform_shape** = [400,0,3] > > Best, > Rok > > On Sat, Sep 16, 2023 at 4:34 AM Jeremy Leibs wrote: > > > > On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc wrote: > > > > > > > > How about also changing shape and adding uniform_shape like so: > > > """ > > > **shape** is a ``FixedSizeList[ndim_ragged]`` of ragged shape > > > of each tensor contained in ``data`` where the size of the list > > > ``ndim_ragged`` is equal to the number of dimensions of tensor > > > subtracted by the number of ragged dimensions. > > > [..] > > > **uniform_shape** > > > Sizes of all contained tensors in their uniform dimensions. > > > """ > > > > > > This would make shape array smaller (in width) if more uniform > > > dimensions were provided. However it would increase the complexity of > > > the extension type a little bit. > > > > > > > > This trade-off doesn't seem worthwhile to me. > > - Shape array will almost always be dramatically smaller than the tensor > > data itself, so the space savings are unlikely to be meaningful in > practice. > > - On the other hand, coding up the index offset math for a sparsely > > represented shape with implicitly interleaved uniform dimensions is much > > more error prone (and less efficient). > > - Even just consider answering a simple question like "What is the size > of > > dimension N": > > > > If `shape` always contains all the dimensions, this is trivially > `shape[N]` > > (or `shape[permuations[N]]` if permutations was specified.) > > > > On the other hand, if `shape` only contains the ragged/variable > dimensions > > this lookup instead becomes something like: > > ``` > > offset = count(uniform_dimensions < N) > > shape[N - offset] > > ``` > > > > Maybe this doesn't seem too bad at first, but does everyone implement > this > > as count()? Does someone implement it as > > `find_lower_bound(uniform_dimension, N)`? Did they validate that > > `uniform_dimensions` was specified as a sorted list? > > > > Now for added risk of errors, consider how this interacts with the > > `permuation`... in my opinion there is way too much thinking required to > > figure out if the correct value is: `shape[permutations[N] - offset]` or > > `shape[permutations[N - offset]]`. > > > > Arrow design guidance typically skews heavily in favor of efficient > > deterministic access over maximally space-efficient representations. > > > > Best, > > Jeremy >
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
I agree, the increased complexity is probably not worth the savings from keeping only shapes of ragged dimensions. However I still think it could be valuable to optionally store uniform dimension shape as metadata. So: **uniform_shape** = Sizes of all contained tensors in their uniform dimensions. Given we have a series of 3 dimensional image tensors [H, W, C] with variable width (uniform_dimenions = [0, 2]): Notation option 1: **uniform_shape** = [400,3] Notation option 2: **uniform_shape** = [400,0,3] Best, Rok On Sat, Sep 16, 2023 at 4:34 AM Jeremy Leibs wrote: > > On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc wrote: > > > > > How about also changing shape and adding uniform_shape like so: > > """ > > **shape** is a ``FixedSizeList[ndim_ragged]`` of ragged shape > > of each tensor contained in ``data`` where the size of the list > > ``ndim_ragged`` is equal to the number of dimensions of tensor > > subtracted by the number of ragged dimensions. > > [..] > > **uniform_shape** > > Sizes of all contained tensors in their uniform dimensions. > > """ > > > > This would make shape array smaller (in width) if more uniform > > dimensions were provided. However it would increase the complexity of > > the extension type a little bit. > > > > > This trade-off doesn't seem worthwhile to me. > - Shape array will almost always be dramatically smaller than the tensor > data itself, so the space savings are unlikely to be meaningful in practice. > - On the other hand, coding up the index offset math for a sparsely > represented shape with implicitly interleaved uniform dimensions is much > more error prone (and less efficient). > - Even just consider answering a simple question like "What is the size of > dimension N": > > If `shape` always contains all the dimensions, this is trivially `shape[N]` > (or `shape[permuations[N]]` if permutations was specified.) > > On the other hand, if `shape` only contains the ragged/variable dimensions > this lookup instead becomes something like: > ``` > offset = count(uniform_dimensions < N) > shape[N - offset] > ``` > > Maybe this doesn't seem too bad at first, but does everyone implement this > as count()? Does someone implement it as > `find_lower_bound(uniform_dimension, N)`? Did they validate that > `uniform_dimensions` was specified as a sorted list? > > Now for added risk of errors, consider how this interacts with the > `permuation`... in my opinion there is way too much thinking required to > figure out if the correct value is: `shape[permutations[N] - offset]` or > `shape[permutations[N - offset]]`. > > Arrow design guidance typically skews heavily in favor of efficient > deterministic access over maximally space-efficient representations. > > Best, > Jeremy
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc wrote: > > How about also changing shape and adding uniform_shape like so: > """ > **shape** is a ``FixedSizeList[ndim_ragged]`` of ragged shape > of each tensor contained in ``data`` where the size of the list > ``ndim_ragged`` is equal to the number of dimensions of tensor > subtracted by the number of ragged dimensions. > [..] > **uniform_shape** > Sizes of all contained tensors in their uniform dimensions. > """ > > This would make shape array smaller (in width) if more uniform > dimensions were provided. However it would increase the complexity of > the extension type a little bit. > > This trade-off doesn't seem worthwhile to me. - Shape array will almost always be dramatically smaller than the tensor data itself, so the space savings are unlikely to be meaningful in practice. - On the other hand, coding up the index offset math for a sparsely represented shape with implicitly interleaved uniform dimensions is much more error prone (and less efficient). - Even just consider answering a simple question like "What is the size of dimension N": If `shape` always contains all the dimensions, this is trivially `shape[N]` (or `shape[permuations[N]]` if permutations was specified.) On the other hand, if `shape` only contains the ragged/variable dimensions this lookup instead becomes something like: ``` offset = count(uniform_dimensions < N) shape[N - offset] ``` Maybe this doesn't seem too bad at first, but does everyone implement this as count()? Does someone implement it as `find_lower_bound(uniform_dimension, N)`? Did they validate that `uniform_dimensions` was specified as a sorted list? Now for added risk of errors, consider how this interacts with the `permuation`... in my opinion there is way too much thinking required to figure out if the correct value is: `shape[permutations[N] - offset]` or `shape[permutations[N - offset]]`. Arrow design guidance typically skews heavily in favor of efficient deterministic access over maximally space-efficient representations. Best, Jeremy
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
First, thanks for all the input! On Wed, Sep 13, 2023 at 6:27 AM Alenka Frim wrote: > In the PR you mention that "this [ragged dimensions] would be purely > metadata that would help converting arrow <-> jagged/ragged". Are there any > examples available to better understand this metadata and how it would be > used in the conversion you mention? After checking tf [1] / pytorch [2] I now believe List[type] would strictly be enough to express ragged/jagged. However, storing ragged/uniform dimensions would probably still be valuable for other purposes. [1] https://github.com/tensorflow/tensorflow/blob/v2.13.0/tensorflow/python/ops/ragged/ragged_tensor.py#L65-L305 [2] https://github.com/pytorch/torchrec/blob/cdd9f20cc6cf090c0d5fc91a01d45723905525e1/torchrec/sparse/jagged_tensor.py#L185-L225 On Wed, Sep 13, 2023 at 11:25 AM Antoine Pitrou wrote: > It's a bit confusing that an empty list means "no ragged dimensions" but > a missing entry means "all dimensions are ragged". This seems > error-prone to me. Good point, I like Jeremy's proposal to address this. > Also, to be clear, "ragged_dimensions" is only useful for data validation? As the proposal stands, yes. I would like to amend it, see below. On Wed, Sep 13, 2023 at 11:41 PM Jeremy Leibs wrote: > I would propose instead: > > **uniform_dimenions** = Indices of dimensions whose sizes are guaranteed to > remain constant. > Indices are a subset of all possible dimension indices ([0, 1, .., N-1]). > The uniform dimensions must > still be represented in the `shape` field, and must always be the same > value for all tensors in the > array -- this allows code to interpret the tensor correctly without > accounting for uniform dimensions > while still permitting optional optimizations that take advantage of the > uniformity. Uniform_dimensions > can be left out, in which case it is assumed that all dimensions might be > variable. I prefer this to the current proposal! > Please consider adding some wording and an example such as: > [..] Will do. How about also changing shape and adding uniform_shape like so: """ **shape** is a ``FixedSizeList[ndim_ragged]`` of ragged shape of each tensor contained in ``data`` where the size of the list ``ndim_ragged`` is equal to the number of dimensions of tensor subtracted by the number of ragged dimensions. [..] **uniform_shape** Sizes of all contained tensors in their uniform dimensions. """ This would make shape array smaller (in width) if more uniform dimensions were provided. However it would increase the complexity of the extension type a little bit. Best, Rok
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
Additionally, after reviewing, I also think the introduction of permutations requires a bit more clarification. Please consider adding some wording and an example such as: With the exception of the permutation parameter, all other lists and storage within the Tensor and the extension parameters define the *physical* storage of the tensor. For example, consider a Tensor with: shape = [10, 20, 30] dim_names = [x, y, z] permutations = [2, 0, 1] This means the logical tensor has names [z, x, y] and shape [30, 10, 20]. Other than that, looks great! Thanks for working on this. -Jeremy On Wed, Sep 13, 2023 at 2:38 AM Rok Mihevc wrote: > After some discussion on the PR [ > https://github.com/apache/arrow/pull/37166] > we've altered the proposed type by removing the ndim parameter and > adding ragged_dimensions one. > If there is no further feedback I'd like to call for a vote early next > week. Proposed language now reads: > > Variable shape tensor > = > > * Extension name: `arrow.variable_shape_tensor`. > > * The storage type of the extension is: ``StructArray`` where struct > is composed of **data** and **shape** fields describing a single > tensor per row: > > * **data** is a ``List`` holding tensor elements of a single tensor. > Data type of the list elements is uniform across the entire column > and also provided in metadata. > * **shape** is a ``FixedSizeList[ndim]`` of the tensor shape > where > the size of the list ``ndim`` is equal to the number of dimensions of > the > tensor. > > * Extension type parameters: > > * **value_type** = the Arrow data type of individual tensor elements. > > Optional parameters describing the logical layout: > > * **dim_names** = explicit names to tensor dimensions > as an array. The length of it should be equal to the shape > length and equal to the number of dimensions. > > ``dim_names`` can be used if the dimensions have well-known > names and they map to the physical layout (row-major). > > * **permutation** = indices of the desired ordering of the > original dimensions, defined as an array. > > The indices contain a permutation of the values [0, 1, .., N-1] where > N is the number of dimensions. The permutation indicates which > dimension of the logical layout corresponds to which dimension of the > physical tensor (the i-th dimension of the logical view corresponds > to the dimension with number ``permutations[i]`` of the physical > tensor). > > Permutation can be useful in case the logical order of > the tensor is a permutation of the physical order (row-major). > > When logical and physical layout are equal, the permutation will always > be ([0, 1, .., N-1]) and can therefore be left out. > > * **ragged_dimensions** = indices of ragged dimensions whose sizes may > differ. Dimensions where all elements have the same size are called > uniform dimensions. Indices are a subset of all possible dimension > indices ([0, 1, .., N-1]). > Ragged dimensions list can be left out. In that case all dimensions > are assumed ragged. > > * Description of the serialization: > > The metadata must be a valid JSON object including number of > dimensions of the contained tensors as an integer with key **"ndim"** > plus optional dimension names with keys **"dim_names"** and ordering of > the dimensions with key **"permutation"**. > > - Example with ``dim_names`` metadata for NCHW ordered data: > > ``{ "dim_names": ["C", "H", "W"] }`` > > - Example with ``ragged_dimensions`` metadata for a set of color images > with variable width: > > ``{ "dim_names": ["H", "W", "C"], "ragged_dimensions": [1] }`` > > - Example of permuted 3-dimensional tensor: > > ``{ "permutation": [2, 0, 1] }`` > > This is the physical layout shape and the shape of the logical > layout would given an individual tensor of shape [100, 200, 500] > be ``[500, 100, 200]``. > > .. note:: > > Elements in a variable shape tensor extension array are stored > in row-major/C-contiguous order. > > > Rok >
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
On Wed, Sep 13, 2023 at 8:38 AM Antoine Pitrou wrote: > > Le 13/09/2023 à 02:37, Rok Mihevc a écrit : > > > >* **ragged_dimensions** = indices of ragged dimensions whose sizes may > > differ. Dimensions where all elements have the same size are called > > uniform dimensions. Indices are a subset of all possible dimension > > indices ([0, 1, .., N-1]). > > Ragged dimensions list can be left out. In that case all dimensions > > are assumed ragged. > > It's a bit confusing that an empty list means "no ragged dimensions" but > a missing entry means "all dimensions are ragged". This seems > error-prone to me. > > Also, to be clear, "ragged_dimensions" is only useful for data validation? > > I am also quite confused by how to interpret / use ragged dimensions. Given that this is a "variable" shaped tensor, I personally find specifying the exceptional case -- the "uniform" dimensions -- to be much more clear. I would propose instead: **uniform_dimenions** = Indices of dimensions whose sizes are guaranteed to remain constant. Indices are a subset of all possible dimension indices ([0, 1, .., N-1]). The uniform dimensions must still be represented in the `shape` field, and must always be the same value for all tensors in the array -- this allows code to interpret the tensor correctly without accounting for uniform dimensions while still permitting optional optimizations that take advantage of the uniformity. Uniform_dimensions can be left out, in which case it is assumed that all dimensions might be variable.
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
Le 13/09/2023 à 02:37, Rok Mihevc a écrit : * **ragged_dimensions** = indices of ragged dimensions whose sizes may differ. Dimensions where all elements have the same size are called uniform dimensions. Indices are a subset of all possible dimension indices ([0, 1, .., N-1]). Ragged dimensions list can be left out. In that case all dimensions are assumed ragged. It's a bit confusing that an empty list means "no ragged dimensions" but a missing entry means "all dimensions are ragged". This seems error-prone to me. Also, to be clear, "ragged_dimensions" is only useful for data validation? Regards Antoine.
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
Hi all, Thank you Rok for all your valuable work on the Arrow tensors! I think the proposed spec and implementation are good and I have no comments on that. In the PR you mention that "this [ragged dimensions] would be purely metadata that would help converting arrow <-> jagged/ragged". Are there any examples available to better understand this metadata and how it would be used in the conversion you mention? Thanks! Alenka On Wed, Sep 13, 2023 at 2:38 AM Rok Mihevc wrote: > After some discussion on the PR [ > https://github.com/apache/arrow/pull/37166] > we've altered the proposed type by removing the ndim parameter and > adding ragged_dimensions one. > If there is no further feedback I'd like to call for a vote early next > week. Proposed language now reads: > > Variable shape tensor > = > > * Extension name: `arrow.variable_shape_tensor`. > > * The storage type of the extension is: ``StructArray`` where struct > is composed of **data** and **shape** fields describing a single > tensor per row: > > * **data** is a ``List`` holding tensor elements of a single tensor. > Data type of the list elements is uniform across the entire column > and also provided in metadata. > * **shape** is a ``FixedSizeList[ndim]`` of the tensor shape > where > the size of the list ``ndim`` is equal to the number of dimensions of > the > tensor. > > * Extension type parameters: > > * **value_type** = the Arrow data type of individual tensor elements. > > Optional parameters describing the logical layout: > > * **dim_names** = explicit names to tensor dimensions > as an array. The length of it should be equal to the shape > length and equal to the number of dimensions. > > ``dim_names`` can be used if the dimensions have well-known > names and they map to the physical layout (row-major). > > * **permutation** = indices of the desired ordering of the > original dimensions, defined as an array. > > The indices contain a permutation of the values [0, 1, .., N-1] where > N is the number of dimensions. The permutation indicates which > dimension of the logical layout corresponds to which dimension of the > physical tensor (the i-th dimension of the logical view corresponds > to the dimension with number ``permutations[i]`` of the physical > tensor). > > Permutation can be useful in case the logical order of > the tensor is a permutation of the physical order (row-major). > > When logical and physical layout are equal, the permutation will always > be ([0, 1, .., N-1]) and can therefore be left out. > > * **ragged_dimensions** = indices of ragged dimensions whose sizes may > differ. Dimensions where all elements have the same size are called > uniform dimensions. Indices are a subset of all possible dimension > indices ([0, 1, .., N-1]). > Ragged dimensions list can be left out. In that case all dimensions > are assumed ragged. > > * Description of the serialization: > > The metadata must be a valid JSON object including number of > dimensions of the contained tensors as an integer with key **"ndim"** > plus optional dimension names with keys **"dim_names"** and ordering of > the dimensions with key **"permutation"**. > > - Example with ``dim_names`` metadata for NCHW ordered data: > > ``{ "dim_names": ["C", "H", "W"] }`` > > - Example with ``ragged_dimensions`` metadata for a set of color images > with variable width: > > ``{ "dim_names": ["H", "W", "C"], "ragged_dimensions": [1] }`` > > - Example of permuted 3-dimensional tensor: > > ``{ "permutation": [2, 0, 1] }`` > > This is the physical layout shape and the shape of the logical > layout would given an individual tensor of shape [100, 200, 500] > be ``[500, 100, 200]``. > > .. note:: > > Elements in a variable shape tensor extension array are stored > in row-major/C-contiguous order. > > > Rok >
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
After some discussion on the PR [https://github.com/apache/arrow/pull/37166] we've altered the proposed type by removing the ndim parameter and adding ragged_dimensions one. If there is no further feedback I'd like to call for a vote early next week. Proposed language now reads: Variable shape tensor = * Extension name: `arrow.variable_shape_tensor`. * The storage type of the extension is: ``StructArray`` where struct is composed of **data** and **shape** fields describing a single tensor per row: * **data** is a ``List`` holding tensor elements of a single tensor. Data type of the list elements is uniform across the entire column and also provided in metadata. * **shape** is a ``FixedSizeList[ndim]`` of the tensor shape where the size of the list ``ndim`` is equal to the number of dimensions of the tensor. * Extension type parameters: * **value_type** = the Arrow data type of individual tensor elements. Optional parameters describing the logical layout: * **dim_names** = explicit names to tensor dimensions as an array. The length of it should be equal to the shape length and equal to the number of dimensions. ``dim_names`` can be used if the dimensions have well-known names and they map to the physical layout (row-major). * **permutation** = indices of the desired ordering of the original dimensions, defined as an array. The indices contain a permutation of the values [0, 1, .., N-1] where N is the number of dimensions. The permutation indicates which dimension of the logical layout corresponds to which dimension of the physical tensor (the i-th dimension of the logical view corresponds to the dimension with number ``permutations[i]`` of the physical tensor). Permutation can be useful in case the logical order of the tensor is a permutation of the physical order (row-major). When logical and physical layout are equal, the permutation will always be ([0, 1, .., N-1]) and can therefore be left out. * **ragged_dimensions** = indices of ragged dimensions whose sizes may differ. Dimensions where all elements have the same size are called uniform dimensions. Indices are a subset of all possible dimension indices ([0, 1, .., N-1]). Ragged dimensions list can be left out. In that case all dimensions are assumed ragged. * Description of the serialization: The metadata must be a valid JSON object including number of dimensions of the contained tensors as an integer with key **"ndim"** plus optional dimension names with keys **"dim_names"** and ordering of the dimensions with key **"permutation"**. - Example with ``dim_names`` metadata for NCHW ordered data: ``{ "dim_names": ["C", "H", "W"] }`` - Example with ``ragged_dimensions`` metadata for a set of color images with variable width: ``{ "dim_names": ["H", "W", "C"], "ragged_dimensions": [1] }`` - Example of permuted 3-dimensional tensor: ``{ "permutation": [2, 0, 1] }`` This is the physical layout shape and the shape of the logical layout would given an individual tensor of shape [100, 200, 500] be ``[500, 100, 200]``. .. note:: Elements in a variable shape tensor extension array are stored in row-major/C-contiguous order. Rok
Re: [DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
Thank you for proposing this! I left a comment on the PR as well, but I'm excited for this to standardize a few concepts that I have run into whilst working on ADBC and GeoArrow: - Properly returning an array with >1 dimension from the PostgreSQL ADBC driver - As the basis for encoding raster tiles as rows in a table (e.g., http://www.geopackage.org/spec/#_tile_matrix_introduction ) Excited to see the PR progress! -dewey On Thu, Aug 17, 2023 at 9:54 AM Rok Mihevc wrote: > > Hey all! > > > Besides the recently added FixedShapeTensor [1] canonical extension type > there appears to be a need for an already proposed VariableShapeTensor > [2]. VariableShapeTensor > would store tensors of variable shapes but uniform number of > dimensions, dimension names and dimension permutations. > > There are examples of such types: Ray implements > ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4]. > > I propose we discuss adding the below text to > format/CanonicalExtensions.rst to read as [5] and a C++/Python > implementation as proposed in [6]. A vote can be called after a discussion > here. > > Variable shape tensor > > = > > * Extension name: `arrow.variable_shape_tensor`. > > * The storage type of the extension is: ``StructArray`` where struct > > is composed of **data** and **shape** fields describing a single > > tensor per row: > > * **data** is a ``List`` holding tensor elements of a single tensor. > > Data type of the list elements is uniform across the entire column > > and also provided in metadata. > > * **shape** is a ``FixedSizeList`` of the tensor shape where > > the size of the list is equal to the number of dimensions of the > > tensor. > > * Extension type parameters: > > * **value_type** = the Arrow data type of individual tensor elements. > > * **ndim** = the number of dimensions of the tensor. > > Optional parameters describing the logical layout: > > * **dim_names** = explicit names to tensor dimensions > > as an array. The length of it should be equal to the shape > > length and equal to the number of dimensions. > > ``dim_names`` can be used if the dimensions have well-known > > names and they map to the physical layout (row-major). > > * **permutation** = indices of the desired ordering of the > > original dimensions, defined as an array. > > The indices contain a permutation of the values [0, 1, .., N-1] where > > N is the number of dimensions. The permutation indicates which > > dimension of the logical layout corresponds to which dimension of the > > physical tensor (the i-th dimension of the logical view corresponds > > to the dimension with number ``permutations[i]`` of the physical > tensor). > > Permutation can be useful in case the logical order of > > the tensor is a permutation of the physical order (row-major). > > When logical and physical layout are equal, the permutation will always > > be ([0, 1, .., N-1]) and can therefore be left out. > > * Description of the serialization: > > The metadata must be a valid JSON object including number of > > dimensions of the contained tensors as an integer with key **"ndim"** > > plus optional dimension names with keys **"dim_names"** and ordering of > > the dimensions with key **"permutation"**. > > - Example: ``{ "ndim": 2}`` > > - Example with ``dim_names`` metadata for NCHW ordered data: > > ``{ "ndim": 3, "dim_names": ["C", "H", "W"]}`` > > - Example of permuted 3-dimensional tensor: > > ``{ "ndim": 3, "permutation": [2, 0, 1]}`` > > This is the physical layout shape and the shape of the logical > > layout would given an individual tensor of shape [100, 200, 500] > > be ``[500, 100, 200]``. > > .. note:: > > Elements in a variable shape tensor extension array are stored > > in row-major/C-contiguous order. > > > [1] https://github.com/apache/arrow/issues/33924 > > [2] https://github.com/apache/arrow/issues/24868 > > [3] > https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809 > > [4] https://pytorch.org/docs/stable/nested.html > > [5] > https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor > > [6] https://github.com/apache/arrow/pull/37166 > > > > Best, > > Rok
[DISCUSS] Proposal to add VariableShapeTensor Canonical Extension Type
Hey all! Besides the recently added FixedShapeTensor [1] canonical extension type there appears to be a need for an already proposed VariableShapeTensor [2]. VariableShapeTensor would store tensors of variable shapes but uniform number of dimensions, dimension names and dimension permutations. There are examples of such types: Ray implements ArrowVariableShapedTensorType [3] and pytorch implements torch.nested [4]. I propose we discuss adding the below text to format/CanonicalExtensions.rst to read as [5] and a C++/Python implementation as proposed in [6]. A vote can be called after a discussion here. Variable shape tensor = * Extension name: `arrow.variable_shape_tensor`. * The storage type of the extension is: ``StructArray`` where struct is composed of **data** and **shape** fields describing a single tensor per row: * **data** is a ``List`` holding tensor elements of a single tensor. Data type of the list elements is uniform across the entire column and also provided in metadata. * **shape** is a ``FixedSizeList`` of the tensor shape where the size of the list is equal to the number of dimensions of the tensor. * Extension type parameters: * **value_type** = the Arrow data type of individual tensor elements. * **ndim** = the number of dimensions of the tensor. Optional parameters describing the logical layout: * **dim_names** = explicit names to tensor dimensions as an array. The length of it should be equal to the shape length and equal to the number of dimensions. ``dim_names`` can be used if the dimensions have well-known names and they map to the physical layout (row-major). * **permutation** = indices of the desired ordering of the original dimensions, defined as an array. The indices contain a permutation of the values [0, 1, .., N-1] where N is the number of dimensions. The permutation indicates which dimension of the logical layout corresponds to which dimension of the physical tensor (the i-th dimension of the logical view corresponds to the dimension with number ``permutations[i]`` of the physical tensor). Permutation can be useful in case the logical order of the tensor is a permutation of the physical order (row-major). When logical and physical layout are equal, the permutation will always be ([0, 1, .., N-1]) and can therefore be left out. * Description of the serialization: The metadata must be a valid JSON object including number of dimensions of the contained tensors as an integer with key **"ndim"** plus optional dimension names with keys **"dim_names"** and ordering of the dimensions with key **"permutation"**. - Example: ``{ "ndim": 2}`` - Example with ``dim_names`` metadata for NCHW ordered data: ``{ "ndim": 3, "dim_names": ["C", "H", "W"]}`` - Example of permuted 3-dimensional tensor: ``{ "ndim": 3, "permutation": [2, 0, 1]}`` This is the physical layout shape and the shape of the logical layout would given an individual tensor of shape [100, 200, 500] be ``[500, 100, 200]``. .. note:: Elements in a variable shape tensor extension array are stored in row-major/C-contiguous order. [1] https://github.com/apache/arrow/issues/33924 [2] https://github.com/apache/arrow/issues/24868 [3] https://github.com/ray-project/ray/blob/ada5db71db36f672301639a61b5849fd4fd5914e/python/ray/air/util/tensor_extensions/arrow.py#L528-L809 [4] https://pytorch.org/docs/stable/nested.html [5] https://github.com/apache/arrow/blob/db8d764ac3e47fa22df13b32fa77b3ad53166d58/docs/source/format/CanonicalExtensions.rst#variable-shape-tensor [6] https://github.com/apache/arrow/pull/37166 Best, Rok