[
https://issues.apache.org/jira/browse/ARROW-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101489#comment-17101489
]
Joris Van den Bossche edited comment on ARROW-8714 at 10/30/20, 12:28 PM:
--------------------------------------------------------------------------
I think a struct with one field with the actual values and one field keeping
track of the shape of each tensor sounds good.
> The start offset of the data for the next tensor can be computed from the
> shape of the previous one.
The field storing the values of the actual tensors will be a variable size
binary or list layout, I suppose. That way, since this is a normal arrow array,
you already have access to the start offset of each tensor (without needing to
calculate it from all previous ones), see
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout
For variable size binary vs variable size list layout, in the end both will be
same physical storage. But using a list array instead of binary array might
make it a bit easier to work with (the data type of the individual values is
then already coded in the list type as well, and eg in the python APIs of
pyarrow, you can easily access the flat array of values of the ListArray as a
single numpy array (from which a part can be sliced and reshaped to get the
tensor).
was (Author: jorisvandenbossche):
I think a struct with one field with the actual values and one field keeping
track of the shape of each tensor sounds good.
> The start offset of the data for the next tensor can be computed from the
> shape of the previous one.
The field storing the values of the actual tensors will be a variable size
binary or list layout, I suppose. That way, since this is a normal arrow array,
you already have access to the start offset of each tensor (without needing to
calculate it from all previous ones), see
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout
For variable size binary vs variable size list layout, in the end both will be
same physical storage. But using a list array instead of binary array might
make it a bit easier to work with (the data type of the individual values is
then already coded in the list type as well, and eg in the python APIs of
pyarrow, you can easily access the flat array of values of the ListArray as a
single numpy array (from which a part can be sliced and reshaped to get the
tensor).
Is your idea to use a variable length binary value for the tensors? Because I
was thinkin, if we use a Variable Size List layout for the tensors field, then
that way you have an easy access to the start index of a certain tensor
(without n
> [C++] Add a Tensor logical value type with varying dimensions, implemented
> using ExtensionType
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-8714
> URL: https://issues.apache.org/jira/browse/ARROW-8714
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Format
> Reporter: Christian Hudon
> Priority: Major
>
> Support for tensor in Table, RecordBatch, etc. where each row is a tensor of
> a different shape (e.g images of different sizes), but of the same underlying
> type (e.g. int32). Implemented as an ExtensionType, so no need to change the
> format.
> I don't see needing each row being a tensor with a different number of
> dimensions, so if the implementation for that falls out easily of the use
> case with each row in the table having a tensor with the same number of
> dimensions, great. If it adds a lot of complexity, that case would be
> postponed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)