Thanks for all the input and feedback so far! I went ahead and added unifom_shape parameter with the second notation option to the proposal [1].
More discussion could be valuable here so I'd like to call the vote on Wednesday. [1] https://github.com/apache/arrow/pull/37166/files/9c827a0ba54280f4695202e17e32902986c4f12f#diff-b54425cb176b53e51925c13a4d4e85cf7d03d4e1226e6d5bf4d7ae09923db8b3 Best, Rok On Sat, Sep 16, 2023 at 3:11 PM Rok Mihevc <[email protected]> wrote: > I agree, the increased complexity is probably not worth the savings > from keeping only shapes of ragged dimensions. > However I still think it could be valuable to optionally store uniform > dimension shape as metadata. So: > > **uniform_shape** = Sizes of all contained tensors in their uniform > dimensions. > > Given we have a series of 3 dimensional image tensors [H, W, C] with > variable width (uniform_dimenions = [0, 2]): > > Notation option 1: > **uniform_shape** = [400,3] > > Notation option 2: > **uniform_shape** = [400,0,3] > > Best, > Rok > > On Sat, Sep 16, 2023 at 4:34 AM Jeremy Leibs <[email protected]> wrote: > > > > On Fri, Sep 15, 2023 at 8:32 PM Rok Mihevc <[email protected]> wrote: > > > > > > > > How about also changing shape and adding uniform_shape like so: > > > """ > > > **shape** is a ``FixedSizeList<uint32>[ndim_ragged]`` of ragged shape > > > of each tensor contained in ``data`` where the size of the list > > > ``ndim_ragged`` is equal to the number of dimensions of tensor > > > subtracted by the number of ragged dimensions. > > > [..] > > > **uniform_shape** > > > Sizes of all contained tensors in their uniform dimensions. > > > """ > > > > > > This would make shape array smaller (in width) if more uniform > > > dimensions were provided. However it would increase the complexity of > > > the extension type a little bit. > > > > > > > > This trade-off doesn't seem worthwhile to me. > > - Shape array will almost always be dramatically smaller than the tensor > > data itself, so the space savings are unlikely to be meaningful in > practice. > > - On the other hand, coding up the index offset math for a sparsely > > represented shape with implicitly interleaved uniform dimensions is much > > more error prone (and less efficient). > > - Even just consider answering a simple question like "What is the size > of > > dimension N": > > > > If `shape` always contains all the dimensions, this is trivially > `shape[N]` > > (or `shape[permuations[N]]` if permutations was specified.) > > > > On the other hand, if `shape` only contains the ragged/variable > dimensions > > this lookup instead becomes something like: > > ``` > > offset = count(uniform_dimensions < N) > > shape[N - offset] > > ``` > > > > Maybe this doesn't seem too bad at first, but does everyone implement > this > > as count()? Does someone implement it as > > `find_lower_bound(uniform_dimension, N)`? Did they validate that > > `uniform_dimensions` was specified as a sorted list? > > > > Now for added risk of errors, consider how this interacts with the > > `permuation`... in my opinion there is way too much thinking required to > > figure out if the correct value is: `shape[permutations[N] - offset]` or > > `shape[permutations[N - offset]]`. > > > > Arrow design guidance typically skews heavily in favor of efficient > > deterministic access over maximally space-efficient representations. > > > > Best, > > Jeremy >
