[
https://issues.apache.org/jira/browse/ARROW-550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15863008#comment-15863008
]
Philipp Moritz commented on ARROW-550:
--------------------------------------
Hey Wes,
Thanks for the note, we are really excited about this feature!
On a high level, we need the ability to serialize arbitrary numpy arrays. This
means:
1) Support for multiple dimensions (obviously)
2) Support for all numpy types
3) Support for numpy arrays that have more elements than can be indexed with an
int32_t (we currently cannot do this but it is very important, we'd be
interested in learning about your plans concerning this)
We have been using
https://github.com/ray-project/ray/blob/master/src/numbuf/python/src/pynumbuf/adapters/numpy.cc
and
https://github.com/ray-project/ray/blob/master/src/numbuf/cpp/src/numbuf/dict.h
so far which doesn't support 3) unfortunately.
4) Something that is not critical but might be good for performance is to have
a way to specify the attributes of the numpy array (row-major, column-major, if
it is transposed or not). That would allow us to avoid reordering/copying in
some cases.
One problem we also have in this context (this is a little separate) is that we
cannot currently use the arrow builder classes without an additional copy
because they require buffers to be aligned to 64 bytes but the numpy arrays
might not be. Finding a solution for this would also be great!
You might also want to look into the tensorflow tensors, see
https://github.com/tensorflow/tensorflow/blob/754048a0453a04a761e112ae5d99c149eb9910dd/tensorflow/core/framework/types.proto
https://github.com/tensorflow/tensorflow/blob/754048a0453a04a761e112ae5d99c149eb9910dd/tensorflow/core/framework/tensor.proto
They have some more types that make sense in the deep learning context.
-- Philipp.
> [Format] Add a TensorMessage type
> ---------------------------------
>
> Key: ARROW-550
> URL: https://issues.apache.org/jira/browse/ARROW-550
> Project: Apache Arrow
> Issue Type: New Feature
> Components: Format
> Reporter: Wes McKinney
>
> Since all data message types at the moment are 1-dimensional, a "tensor"
> message will contain an array of dimensions and an order flag (C order vs.
> Fortran order) to enable data to be interpreted as multiple dimensions. This
> is similar to multidimensional arrays in APL or Fortran or MATLAB, ndarrays
> in NumPy, etc.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)