Github user mengxr commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35449886 @fommil @MLnick I included MTJ into the benchmarks (see the updated comment above). Basically it performs very similar to breeze. @martinjaggi Gradient based method needs dot product between sparse and dense vectors, or multiplication between sparse matrix and dense vectors if we consider creating a local sparse matrix first. If the input RDD to gradient based method is not cached, I would recommend cache it first or down-sample it if it is too large to cache. If serialization of the input data occurs for every iteration, the computation cost becomes negligible. If data is cached and we don't copy data around during the conversion from the data model we defined and the underlying vector implementation, the overhead is very small. I'm also working on a performance test suite for MLlib algorithms to make it easy for us to do the comparison.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---