[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

mengxr Tue, 18 Feb 2014 15:46:31 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/incubator-spark/pull/575#issuecomment-35449886
  
    @fommil @MLnick I included MTJ into the benchmarks (see the updated comment 
above). Basically it performs very similar to breeze.
    
    @martinjaggi Gradient based method needs dot product between sparse and 
dense vectors, or multiplication between sparse matrix and dense vectors if we 
consider creating a local sparse matrix first. If the input RDD to gradient 
based method is not cached, I would recommend cache it first or down-sample it 
if it is too large to cache. If serialization of the input data occurs for 
every iteration, the computation cost becomes negligible. If data is cached and 
we don't copy data around during the conversion from the data model we defined 
and the underlying vector implementation, the overhead is very small. I'm also 
working on a performance test suite for MLlib algorithms to make it easy for us 
to do the comparison.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.
---

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

Reply via email to