Re: Sparse matrix formats

Jed Brown Mon, 11 Mar 2019 10:18:06 -0700

Thanks.  I'm new to the Arrow community so was hoping to get feedback if
any of these are controversial or subject to constraints that I'm likely
not familiar with.  Point 2 is likely simplest and I can start with
that.


Point 3 isn't coherent as a PR concept, but is a potential audience
whose relation to Arrow I don't understand (could be an explicit
non-goal for all I know).

Wes McKinney <wesmck...@gmail.com> writes:

> hi Jed,
>
> Would you like to submit a pull request to propose the changes or
> additions you are escribing?
>
> Thanks
> Wes
>
> On Sat, Mar 9, 2019 at 11:32 PM Jed Brown <j...@jedbrown.org> wrote:
>>
>> Wes asked me to bring this discussion here.  I'm a developer of PETSc
>> and, with Arrow is getting into the sparse representation space, would
>> like for it to interoperate as well as possible.
>>
>> 1. Please guarantee support for 64-bit offsets and indices.  The current
>> spec uses "long", which is 32-bit on some common platforms (notably
>> Windows).  Specifying int64_t would bring that size support to LLP64
>> architectures.
>>
>> Meanwhile, there is a significant performance benefit to using 32-bit
>> indices when sizes allow it.  If using 4-byte floats, a CSR format costs
>> 12 bytes per nonzero with int64, versus 8 bytes per nonzero with int32.
>> Sparse matrix operations are typically dominated by bandwidth, so this
>> is a substantial performance impact.
>>
>> 2. This relates to sorting of indices for CSR.  Many algorithms need to
>> operate on upper vs lower-triangular parts of matrices, which is much
>> easier if the CSR column indices are sorted within rows.  Typically, one
>> finds the diagonal (storing its location in each row, if it exists).
>> Given that the current spec says COO entries are sorted, it would be
>> simple to specify this also for CSR.
>>
>>   https://github.com/apache/arrow/blob/master/format/SparseTensor.fbs
>>
>>
>> 3. This is more nebulous and relates to partitioned representations of
>> matrices, such as arise when using distributed memory or when optimizing
>> for locality when using threads.  Some packages store "global" indices
>> in the CSR representation (in which case you can have column indices
>> that are larger than the specified sizes) -- I discourage this except as
>> intermediate representations in matrix-matrix computations.  Others
>> (including PETSc) store local matrices plus a "local-to-global"
>> translation of local indices to the distributed setting.  This may be a
>> no-op for Arrow (no support, but don't prevent), but I'm curious whether
>> Arrow has a philosophy regarding collective use of distributed memory
>> (e.g., via MPI).
>>
>>
>> Finally, there are many other matrix formats to at least be aware of.
>> These include blocked CSR (fewer indices to store), sliced Ellpack-style
>> formats (better vectorization for some sparse matrix operations), and
>> compressed row offsets for CSR (good if many rows are empty).
>> Conventions regarding diagonals and how to express parallel matrices
>> vary by package, but often requires some conversion (can sometimes be
>> done in-place) to use different packages.  PETSc has interfaces to many
>> such packages, so we have many examples of such conversions.
>>
>> Thanks for reading.  I'm happy to discuss further.

Re: Sparse matrix formats

Reply via email to