[GitHub] [arrow] jorisvandenbossche commented on pull request #33925: GH-33923: [Docs] Tensor canonical extension type specification

via GitHub Tue, 31 Jan 2023 08:38:58 -0800


jorisvandenbossche commented on PR #33925:
URL: https://github.com/apache/arrow/pull/33925#issuecomment-1410709198

@mariosasko Thanks for the feedback!

> In Datasets, we store fixed-size tensors as variable-length lists, which
is suboptimal as it also stores the offsets

I noticed that in the HuggingFace Datasets implementation, there is a
[comment](https://github.com/huggingface/datasets/blob/5b793dd8c43bf6e85f165238becb3c64f6cd3ed0/src/datasets/features/features.py#L659-L660)
about using variable instead of fixed size list array. That is resolved now,
and it's fine to use fixed-size lists?

---

Related to the `is_row_major` / `order` parameter, we were having some
discussion in zulip chat ([this public
stream](https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Canonical.20extension.20type.20for.20tensors/near/324882917)).
Summarizing that here: I wonder to what extent the `is_row_major` / `order`
keyword is parameter actually useful.

- I think none of the existing custom tensor extension type implementations
have this (eg neither of Ray and HuggingFace Datasets have this. @mariosasko do
you know if that has come up in HuggingFace?)
- The first dimension always needs to be the one that matches the length of
the array and have the biggest stride (to match our list array layout). So even
if you would store each individual tensor as column-major (F contiguous), the
full ndarray representing the full TensorArray wouldn't be F-contiguous.
- Not super familiar here, but I _think_ some of the ML frameworks like
tensorflow also only deal with row-major data anyway. @rok mentioned an example
of a different "channel last" layout in pytorch
(https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html). But
that specific example also isn't exactly column-major for each individual
tensor (after the first dimension), but something custom (strides of `(1, 96,
3)` for tensor shape of (3, 32, 32) in their example, so neither one of `(1024,
32, 1)` or `(1, 3, 96)`). So to support this use case, we would actually need a
`strides` parameter, and not a `order`/`is_row_major` parameter.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #33925: GH-33923: [Docs] Tensor canonical extension type specification

Reply via email to