Github user mengxr commented on the pull request:
https://github.com/apache/incubator-spark/pull/575#issuecomment-35017848
@shivaram @srowen @giyengar Thanks for keeping the discussion running!
@shivaram The requirement is to add sparse data support in all existing
MLlib algorithms. The first decision we want to make is what interfaces we want
to provide for sparse data, and the second decision to make is how to implement
sparse algorithms internally, which package to use or shall we implement and
maintain our own.
1. We need sequential access sparse vector for gradient-based algorithm and
random access sparse vector for feature transformation and tree-based
algorithms. The input to clustering/classification algorithms should be labeled
sparse/dense vectors, which is easy for users to provide. We can assemble local
sparse matrix blocks (CSR or CSC) if it improves the performance and later make
the interface available for advanced users, but this is out of the scope of
this discussion. For collaborative filtering, I believe the most convenient
format for users is (i, j, x) (COO).
2 & 3. Yes, we should stick to native BLAS/LAPACK for level 2 & 3
operations. But if we stick to JBLAS, we have to wrap JBLAS's dense
vector/matrix in order to interact with sparse vectors and maintain the code.
However, if breeze manages to have very good performance and provides unified
interface for both dense and sparse linear algebra. I would certainly choose
breeze with netlib-java from JBLAS. The only reason I didn't use breeze in this
PR is the slow dense + generic operation, which might be fixed already. If
@dlwh plans to make a release in the near future, I'm happy to do a benchmark
with existing JBLAS implementation.