ollemartensson opened a new issue, #565:
URL: https://github.com/apache/arrow-julia/issues/565

   ### Overview
   
   The addition of Tensor support is valuable for positioning Arrow.jl as a 
great library for machine learning and scientific computing. These 
multi-dimensional data structures are currently unsupported but are defined 
within the Arrow format specification. The Arrow specification provides 
distinct formats for dense and sparse multi-dimensional arrays, this issue are 
focusing on sparse multi-dimensional arrays.
   
   ### Sparse Tensors
   
   Sparse tensors provide an efficient storage format for data where a majority 
of the elements are zero.18 The Arrow format defines a
   [SparseTensor message type in its FlatBuffers 
schema](https://github.com/apache/arrow/blob/main/format/SparseTensor.fbs#L209),
 which supports several index formats to represent the locations of the 
non-zero values.
   * [Coordinate 
(COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_(COO)): This 
is the most straightforward sparse format. It explicitly stores the coordinates 
of every non-zero element.
     * Storage: It uses two primary buffers:
       1. data: A contiguous buffer containing the non-zero values themselves.
       2. 
[indicesBuffer](https://github.com/apache/arrow/blob/main/format/SparseTensor.fbs#L34):
 A buffer representing an N×M matrix, where N is the number of non-zero values 
and M is the number of dimensions of the tensor. Each row of this matrix is a 
tuple of coordinates for a corresponding value in the data buffer.
     * Use Case: COO is best when you're adding data points one by one, or when 
your data is scattered randomly without a clear pattern.18
   * [Compressed Sparse Matrix 
(CSX)](https://www.researchgate.net/publication/221643637_CSX_An_Extended_Compression_Format_for_SpMV_on_Shared_Memory_Systems):
 This format is specialized for 2D tensors (matrices) and includes [Compressed 
Sparse Row 
(CSR)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_(CSR,_CRS_or_Yale_format))
 and [Compressed Sparse Column (CSC) 
](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_column_(CSC_or_CCS))variants.
 It achieves compression by not repeating row (for CSR) or column (for CSC) 
indices.
     * Storage: It uses three buffers:
       1. data: A buffer of the non-zero values.
       2. indicesBuffer: A buffer of indices for the uncompressed dimension 
(e.g., column indices for CSR).
       3. [indptrBuffer (Index Pointer 
Buffer)](https://github.com/apache/arrow/blob/main/format/SparseTensor.fbs#L87):
 A buffer of length (compressed_dimension_size + 1). It stores offsets into the 
indicesBuffer and data buffer. The non-zero elements for row i are located 
between indptr[i] and indptr[i+1].
     * Use Case: CSX for matrix operations like matrix-vector multiplication.
   * [Compressed Sparse Fiber 
(CSF)](https://www.boristhebrave.com/2021/01/01/compressed-sparse-fibers-explained/):
 This is the most general and advanced format, extending the compression scheme 
of CSR/CSC to tensors of arbitrary dimension. It recursively compresses the 
tensor dimension by dimension.
      * Storage: Its structure is more complex and hierarchical:
         1. data: A buffer of the non-zero values.
         2. indicesBuffers: An array of buffers, one for each dimension of the 
tensor, storing the coordinate values at each level of the hierarchy.
         3. [indptrBuffer (Index Pointer 
Buffer)](https://github.com/apache/arrow/blob/main/format/SparseTensor.fbs#L87):
 An array of buffers, one for each level of the hierarchy (except the last), 
storing pointers that define the relationship between a coordinate in one 
dimension and its children coordinates in the next dimension.
     * Use Case: CSF provides excellent compression and performance for a wide 
range of tensor algebra operations on highly structured sparse data but is the 
most complex to implement. https://stephenchou.net/files/chou-18-sm-thesis.pdf
   
   ### Proposed Design
   New struct types could be defined that act as zero-copy views over the 
underlying Arrow memory buffers.
   An abstract supertype could be defined, with concrete implementations for 
each sparse format.
   
   ```
   abstract type AbstractSparseTensor{T, N} <: AbstractArray{T, N} end
   
   struct SparseTensorCOO{T, N} <: AbstractSparseTensor{T, N}
       indices::AbstractMatrix{<:Integer} # View over indicesBuffer
       data::AbstractVector{T}             # View over data buffer
       shape::NTuple{N, Int}
   end
   
   struct SparseTensorCSX{T} <: AbstractSparseTensor{T, 2}
       indptr::AbstractVector{<:Integer}   # View over indptrBuffer
       indices::AbstractVector{<:Integer}  # View over indicesBuffer
       data::AbstractVector{T}             # View over data buffer
       shape::NTuple{2, Int}
       compressed_axis::Symbol # :Row or :Column
   end
   ```
   
   This design allows the DenseTensor to leverage Julia's rich AbstractArray 
interface for slicing and other operations while ensuring data is never copied 
from the Arrow buffer. Each of these structs would hold views (e.g., created 
with unsafe_wrap) into the raw memory buffers provided by the Arrow data, again 
ensuring a zero-copy representation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to