jorisvandenbossche commented on PR #33925: URL: https://github.com/apache/arrow/pull/33925#issuecomment-1410709198
@mariosasko Thanks for the feedback! > In Datasets, we store fixed-size tensors as variable-length lists, which is suboptimal as it also stores the offsets I noticed that in the HuggingFace Datasets implementation, there is a [comment](https://github.com/huggingface/datasets/blob/5b793dd8c43bf6e85f165238becb3c64f6cd3ed0/src/datasets/features/features.py#L659-L660) about using variable instead of fixed size list array. That is resolved now, and it's fine to use fixed-size lists? --- Related to the `is_row_major` / `order` parameter, we were having some discussion in zulip chat ([this public stream](https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Canonical.20extension.20type.20for.20tensors/near/324882917)). Summarizing that here: I wonder to what extent the `is_row_major` / `order` keyword is parameter actually useful. - I think none of the existing custom tensor extension type implementations have this (eg neither of Ray and HuggingFace Datasets have this. @mariosasko do you know if that has come up in HuggingFace?) - The first dimension always needs to be the one that matches the length of the array and have the biggest stride (to match our list array layout). So even if you would store each individual tensor as column-major (F contiguous), the full ndarray representing the full TensorArray wouldn't be F-contiguous. - Not super familiar here, but I _think_ some of the ML frameworks like tensorflow also only deal with row-major data anyway. @rok mentioned an example of a different "channel last" layout in pytorch (https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html). But that specific example also isn't exactly column-major for each individual tensor (after the first dimension), but something custom (strides of `(1, 96, 3)` for tensor shape of (3, 32, 32) in their example, so neither one of `(1024, 32, 1)` or `(1, 3, 96)`). So to support this use case, we would actually need a `strides` parameter, and not a `order`/`is_row_major` parameter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org