[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

mengxr Thu, 13 Feb 2014 11:48:44 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/incubator-spark/pull/575#issuecomment-35017848
  
    @shivaram @srowen @giyengar Thanks for keeping the discussion running!
    
    @shivaram The requirement is to add sparse data support in all existing 
MLlib algorithms. The first decision we want to make is what interfaces we want 
to provide for sparse data, and the second decision to make is how to implement 
sparse algorithms internally, which package to use or shall we implement and 
maintain our own.
    
    1. We need sequential access sparse vector for gradient-based algorithm and 
random access sparse vector for feature transformation and tree-based 
algorithms. The input to clustering/classification algorithms should be labeled 
sparse/dense vectors, which is easy for users to provide. We can assemble local 
sparse matrix blocks (CSR or CSC) if it improves the performance and later make 
the interface available for advanced users, but this is out of the scope of 
this discussion. For collaborative filtering, I believe the most convenient 
format for users is (i, j, x) (COO). 
    
    2 & 3. Yes, we should stick to native BLAS/LAPACK for level 2 & 3 
operations. But if we stick to JBLAS, we have  to wrap JBLAS's dense 
vector/matrix in order to interact with sparse vectors and maintain the code. 
However, if breeze manages to have very good performance and provides unified 
interface for both dense and sparse linear algebra. I would certainly choose 
breeze with netlib-java from JBLAS. The only reason I didn't use breeze in this 
PR is the slow dense + generic operation, which might be fixed already. If 
@dlwh plans to make a release in the near future, I'm happy to do a benchmark 
with existing JBLAS implementation.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

Reply via email to