[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

srowen Mon, 10 Feb 2014 14:34:49 -0800

Github user srowen commented on the pull request:

    https://github.com/apache/incubator-spark/pull/575#issuecomment-34692729
  
    The mahout-math implementation of vectors is encumbered with a few bad 
design choices, Hadoop stuff that's not needed here, dependence on that old 
fork of colt code, and a few lingering bugs. From experience I would strongly 
recommend not using this code. You're having to use reflection (!!) and 
wrappers to get it working -- this can't be OK for new code.
    
    MLLib is already using JBlas. That doesn't have a sparse representation, 
but, this is making things worse since it's bringing in and using a second 
dense vector representation.
    
    I have used Commons Math successfully, but strangely they're deprecating 
the sparse representation, although it's been perfectly fine for me. I'd 
recommend it, still.
    
    In the past I have used Commons Math in order to have one unified API for 
sparse/dense, and then translated to JBlas in key cases for speed. (Using JBlas 
everywhere might not be a great idea, actually.) I'd recommend that road if 
we're bothering to overhaul this.
    
    Failing that, if MLLib is going to use JBlas everywhere it can be used, it 
should stick to JBlas for all dense vectors and matrices. Something else is 
needed for sparse. I still recommend Commons Math, or some derivative of it. 
There's always the possibility of writing a JBlas-like sparse API, which would 
be tidy and consistent, but annoying to reinvent the wheel again.
    
    (mahout does not compile against Hadoop 2 unless you change the profile -- 
this is why you need CDH4.5-mr1 artifacts, or need to use the Hadoop 2 profile)

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

Reply via email to