Re: [VOTE][Format] Variable shape tensor canonical extension type

Rok Mihevc Fri, 06 Oct 2023 04:08:25 -0700

Hey All,

We have 4 binding +1 votes, no non-binding +1 votes, and no -1 votes, so
the vote passes.


Thanks everyone for your work and participation on this!

As a follow up we will:
[ ] merge changes to the format (
https://github.com/apache/arrow/pull/37166/files)
[ ] merge C++ and Python implementation (
https://github.com/apache/arrow/pull/38008)


Rok

On Mon, Oct 2, 2023 at 4:25 PM Rok Mihevc <rok.mih...@gmail.com> wrote:

> +1
> Thanks everyone for voting!
>
> I'd like to leave the vote open until Wednesday,
>
> Rok
>
> On Fri, Sep 29, 2023 at 8:58 PM Matt Topol <zotthewiz...@gmail.com> wrote:
>
>> +1
>>
>> Thanks for all the work here!
>>
>> On Fri, Sep 29, 2023 at 11:04 AM Dewey Dunnington
>> <de...@voltrondata.com.invalid> wrote:
>>
>> > +1! Thank you for iterating on this with all of us!
>> >
>> > On Fri, Sep 29, 2023 at 11:28 AM Alenka Frim
>> > <ale...@voltrondata.com.invalid> wrote:
>> > >
>> > > +1
>> > > Thanks for pushing this through!
>> > >
>> > > On Wed, Sep 27, 2023 at 2:44 PM Rok Mihevc <rok.mih...@gmail.com>
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Following the discussion [1][2] I would like to propose a vote to
>> add
>> > > > variable shape tensor canonical extension type language to
>> > > > CanonicalExtensions.rst [3] as written below.
>> > > > A draft C++ implementation and a Python wrapper can be seen here
>> [2].
>> > > >
>> > > > The vote will be open for at least 72 hours.
>> > > >
>> > > > [ ] +1 Accept this proposal
>> > > > [ ] +0
>> > > > [ ] -1 Do not accept this proposal because...
>> > > >
>> > > >
>> > > > [1]
>> https://lists.apache.org/thread/qc9qho0fg5ph1dns4hjq56hp4tj7rk1k
>> > > > [2] https://github.com/apache/arrow/pull/37166
>> > > > [3]
>> > > >
>> > > >
>> >
>> https://github.com/apache/arrow/blob/main/docs/source/format/CanonicalExtensions.rst
>> > > >
>> > > >
>> > > > Variable shape tensor
>> > > > =====================
>> > > >
>> > > > * Extension name: `arrow.variable_shape_tensor`.
>> > > >
>> > > > * The storage type of the extension is: ``StructArray`` where struct
>> > > >   is composed of **data** and **shape** fields describing a single
>> > > >   tensor per row:
>> > > >
>> > > >   * **data** is a ``List`` holding tensor elements of a single
>> tensor.
>> > > >     Data type of the list elements is uniform across the entire
>> column.
>> > > >   * **shape** is a ``FixedSizeList<uint32>[ndim]`` of the tensor
>> shape
>> > > > where
>> > > >     the size of the list ``ndim`` is equal to the number of
>> dimensions
>> > of
>> > > > the
>> > > >     tensor.
>> > > >
>> > > > * Extension type parameters:
>> > > >
>> > > >   * **value_type** = the Arrow data type of individual tensor
>> elements.
>> > > >
>> > > >   Optional parameters describing the logical layout:
>> > > >
>> > > >   * **dim_names** = explicit names of tensor dimensions
>> > > >     as an array. The length of it should be equal to the shape
>> > > >     length and equal to the number of dimensions.
>> > > >
>> > > >     ``dim_names`` can be used if the dimensions have well-known
>> > > >     names and they map to the physical layout (row-major).
>> > > >
>> > > >   * **permutation**  = indices of the desired ordering of the
>> > > >     original dimensions, defined as an array.
>> > > >
>> > > >     The indices contain a permutation of the values [0, 1, .., N-1]
>> > where
>> > > >     N is the number of dimensions. The permutation indicates which
>> > > >     dimension of the logical layout corresponds to which dimension
>> of
>> > the
>> > > >     physical tensor (the i-th dimension of the logical view
>> corresponds
>> > > >     to the dimension with number ``permutations[i]`` of the physical
>> > > > tensor).
>> > > >
>> > > >     Permutation can be useful in case the logical order of
>> > > >     the tensor is a permutation of the physical order (row-major).
>> > > >
>> > > >     When logical and physical layout are equal, the permutation will
>> > always
>> > > >     be ([0, 1, .., N-1]) and can therefore be left out.
>> > > >
>> > > >   * **uniform_dimensions** = indices of dimensions whose sizes are
>> > > >     guaranteed to remain constant. Indices are a subset of all
>> possible
>> > > >     dimension indices ([0, 1, .., N-1]).
>> > > >     The uniform dimensions must still be represented in the
>> ``shape``
>> > > > field,
>> > > >     and must always be the same value for all tensors in the array
>> --
>> > this
>> > > >     allows code to interpret the tensor correctly without accounting
>> > for
>> > > >     uniform dimensions while still permitting optional optimizations
>> > that
>> > > >     take advantage of the uniformity. ``uniform_dimensions`` can be
>> > left
>> > > > out,
>> > > >     in which case it is assumed that all dimensions might be
>> variable.
>> > > >
>> > > >   * **uniform_shape** = shape of the dimensions that are guaranteed
>> to
>> > stay
>> > > >     constant over all tensors in the array, with the shape of the
>> > ragged
>> > > > dimensions
>> > > >     set to 0.
>> > > >     An array containing a tensor with shape (2, 3, 4) and
>> > > > ``uniform_dimensions``
>> > > >     (0, 2) would have ``uniform_shape`` (2, 0, 4).
>> > > >
>> > > > * Description of the serialization:
>> > > >
>> > > >   The metadata must be a valid JSON object, that optionally includes
>> > > >   dimension names with keys **"dim_names"**, ordering of
>> > > >   dimensions with key **"permutation"**, indices of dimensions whose
>> > sizes
>> > > >   are guaranteed to remain constant with key
>> **"uniform_dimensions"**
>> > and
>> > > >   shape of those dimensions with key **"uniform_shape"**.
>> > > >   Minimal metadata is an empty JSON object.
>> > > >
>> > > >   - Example of minimal metadata is:
>> > > >
>> > > >     ``{}``
>> > > >
>> > > >   - Example with ``dim_names`` metadata for NCHW ordered data:
>> > > >
>> > > >     ``{ "dim_names": ["C", "H", "W"] }``
>> > > >
>> > > >   - Example with ``uniform_dimensions`` metadata for a set of color
>> > images
>> > > >     with variable width:
>> > > >
>> > > >     ``{ "dim_names": ["H", "W", "C"], "uniform_dimensions": [1] }``
>> > > >
>> > > >   - Example of permuted 3-dimensional tensor:
>> > > >
>> > > >     ``{ "permutation": [2, 0, 1] }``
>> > > >
>> > > >     This is the physical layout shape and the shape of the logical
>> > > >     layout given an individual tensor of shape [100, 200, 500] would
>> > > >     be ``[500, 100, 200]``.
>> > > >
>> > > > .. note::
>> > > >
>> > > >   With the exception of permutation all other parameters and storage
>> > > >   of VariableShapeTensor define the *physical* storage of the
>> tensor.
>> > > >
>> > > >   For example, consider a tensor with:
>> > > >     shape = [10, 20, 30]
>> > > >     dim_names = [x, y, z]
>> > > >     permutations = [2, 0, 1]
>> > > >
>> > > >   This means the logical tensor has names [z, x, y] and shape [30,
>> 10,
>> > 20].
>> > > >
>> > > >   Elements in a variable shape tensor extension array are stored
>> > > >   in row-major/C-contiguous order.
>> > > >
>> > > >
>> > > >
>> > > > Rok
>> > > >
>> >
>>
>

Re: [VOTE][Format] Variable shape tensor canonical extension type

Reply via email to