Hi,
This sounds fine on the principle. I'll let other comment on the details. Regards Antoine. Le 19/08/2019 à 11:29, Kenta Murata a écrit : > Hi, > > I’d like to propose the following improvement of the sparse tensor > format and implementation. > > (1) To make variable bit-width indices available. > > The main purpose of the first part of the proposal is making 32-bit > indices available. It allows us to serialize scipy.sparse.csr_matrix > objects etc. with 32-bit indices without converting the index arrays > to 64-bit values. As Jed said in the previous discussion [1] in this > ML, since 32-bit indices have advantages of the small memory > footprints, I strongly consider this change is necessary for the > sparse tensor support for Apache Arrow. Adding both the type field in > each sparse index format and the stride field in SparseCOOIndex format > is necessary to do this. > > (2) Adding the new COO format with separated row and column indices > > Scipy.sparse.coo_matrix manages the indices of row and column in > separated numpy arrays. It is enough for representing a sparse > matrix. On the other hand, for supporting sparse tensors with > arbitrary ranks, Arrow's SparseCOOIndex manages COO indices as one > matrix. Hence we need to make a copy of indices to convert > scipy.sparse.coo_matrix to Arrow’s SparseTensor. Introducing the new > COO format with separated row and column indices can resolve this > issue. > > (3) Adding SparseCSCIndex > > The CSC format of sparse matrices has the advantage of faster scanning > in columnar direction while the CSR format is faster in a row-wise > scan. Because The aptitude of CSC is different from the one of CSR, I > want to support CSC before releasing Arrow 1.0. > > There are work-in-progress branch [2] of (1) above. I’d appreciate > any comments or suggestions. > > [1] > http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%[email protected]%3e > > [2] https://github.com/mrkn/arrow/tree/sparse_tensor_index_value_type > > Regards, > Kenta Murata >
