I'm interested in making a language-agnostic sparse tensor format. I believe 
one of the suitable places to do this is Apache Arrow, so let me propose my 
idea of this here.

First of all, I found that there is no common memory layout of sparse tensor 
representations in my investigation. It means we need some kinds of conversion 
to share sparse tensors among different systems even if the data format is 
logically the same. It is the same situation as dataframe, and this is the 
reason why I believe Apache Arrow is the suitable place.

There are many formats to represent a sparse tensor. Most of them are 
specialized for a matrix, which has two dimensions. There are few formats for 
general sparse tensor with more than two dimensions.

I think the COO format is suitable to start because COO can handle any 
dimensions, and many systems support the COO format. In my investigation, the 
systems support COO are SciPy, dask, pydata/sparse, TensorFlow, and PyTorch.

Additionally, CSR format for matrices may also be good to support at the first 
time. The reason is that CSR format is efficient to extract row slices, that 
may be important for extracting samples from tidy data, and it is supported by 
SciPy, MXNet, and R's Matrix library.

I add my prototype definition of SparseTensor format in this pull-request. I 
designed this prototype format to be extensible so that we can support 
additional sparse formats. I think we at least need to support additional 
sparse tensor format for more than two dimensions in addition to COO so we will 
need this extensibility.

[ Full content available at: https://github.com/apache/arrow/pull/2546 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to