[ 
https://issues.apache.org/jira/browse/ARROW-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101489#comment-17101489
 ] 

Joris Van den Bossche edited comment on ARROW-8714 at 10/30/20, 12:28 PM:
--------------------------------------------------------------------------

I think a struct with one field with the actual values and one field keeping 
track of the shape of each tensor sounds good.

> The start offset of the data for the next tensor can be computed from the 
> shape of the previous one.

The field storing the values of the actual tensors will be a variable size 
binary or list layout, I suppose. That way, since this is a normal arrow array, 
you already have access to the start offset of each tensor (without needing to 
calculate it from all previous ones), see 
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout

For variable size binary vs variable size list layout, in the end both will be 
same physical storage. But using a list array instead of binary array might 
make it a bit easier to work with (the data type of the individual values is 
then already coded in the list type as well, and eg in the python APIs of 
pyarrow, you can easily access the flat array of values of the ListArray as a 
single numpy array (from which a part can be sliced and reshaped to get the 
tensor). 



was (Author: jorisvandenbossche):
I think a struct with one field with the actual values and one field keeping 
track of the shape of each tensor sounds good.

> The start offset of the data for the next tensor can be computed from the 
> shape of the previous one.

The field storing the values of the actual tensors will be a variable size 
binary or list layout, I suppose. That way, since this is a normal arrow array, 
you already have access to the start offset of each tensor (without needing to 
calculate it from all previous ones), see 
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout

For variable size binary vs variable size list layout, in the end both will be 
same physical storage. But using a list array instead of binary array might 
make it a bit easier to work with (the data type of the individual values is 
then already coded in the list type as well, and eg in the python APIs of 
pyarrow, you can easily access the flat array of values of the ListArray as a 
single numpy array (from which a part can be sliced and reshaped to get the 
tensor). 

Is your idea to use a variable length binary value for the tensors? Because I 
was thinkin, if we use a Variable Size List layout for the tensors field, then 
that way you have an easy access to the start index of a certain tensor 
(without n

> [C++] Add a Tensor logical value type with varying dimensions, implemented 
> using ExtensionType
> ----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-8714
>                 URL: https://issues.apache.org/jira/browse/ARROW-8714
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Format
>            Reporter: Christian Hudon
>            Priority: Major
>
> Support for tensor in Table, RecordBatch, etc. where each row is a tensor of 
> a different shape (e.g images of different sizes), but of the same underlying 
> type (e.g. int32). Implemented as an ExtensionType, so no need to change the 
> format. 
> I don't see needing each row being a tensor with a different number of 
> dimensions, so if the implementation for that falls out easily of the use 
> case with each row in the table having a tensor with the same number of 
> dimensions, great. If it adds a lot of complexity, that case would be 
> postponed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to