jorisvandenbossche commented on PR #33925:
URL: https://github.com/apache/arrow/pull/33925#issuecomment-1410709198

   @mariosasko Thanks for the feedback!
   
   > In Datasets, we store fixed-size tensors as variable-length lists, which 
is suboptimal as it also stores the offsets
   
   I noticed that in the HuggingFace Datasets implementation, there is a 
[comment](https://github.com/huggingface/datasets/blob/5b793dd8c43bf6e85f165238becb3c64f6cd3ed0/src/datasets/features/features.py#L659-L660)
 about using variable instead of fixed size list array. That is resolved now, 
and it's fine to use fixed-size lists? 
   
   ---
   
   Related to the `is_row_major` / `order` parameter, we were having some 
discussion in zulip chat ([this public 
stream](https://ursalabs.zulipchat.com/#narrow/stream/180245-dev/topic/Canonical.20extension.20type.20for.20tensors/near/324882917)).
 Summarizing that here: I wonder to what extent the `is_row_major` / `order` 
keyword is parameter actually useful. 
   
   - I think none of the existing custom tensor extension type implementations 
have this (eg neither of Ray and HuggingFace Datasets have this. @mariosasko do 
you know if that has come up in HuggingFace?)
   - The first dimension always needs to be the one that matches the length of 
the array and have the biggest stride (to match our list array layout). So even 
if you would store each individual tensor as column-major (F contiguous), the 
full ndarray representing the full TensorArray wouldn't be F-contiguous.
   - Not super familiar here, but I _think_ some of the ML frameworks like 
tensorflow also only deal with row-major data anyway. @rok mentioned an example 
of a different "channel last" layout in pytorch 
(https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html). But 
that specific example also isn't exactly column-major for each individual 
tensor (after the first dimension), but something custom (strides of `(1, 96, 
3)` for tensor shape of (3, 32, 32) in their example, so neither one of `(1024, 
32, 1)` or `(1, 3, 96)`). So to support this use case, we would actually need a 
`strides` parameter, and not a `order`/`is_row_major` parameter.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to