Github user fommil commented on the pull request: https://github.com/apache/incubator-spark/pull/575#issuecomment-35216981 @martinjaggi I believe you would be making a massive mistake by agreeing on a single serialisation scheme for sparse vectors, unless that format is independent of the storage structure, e.g. MatrixMarket, or if you're happy to pass implementation detail of the sparse format along with the structure. BTW, solving the generic problem of building a sparse matrix from a data stream is far from solved... some structures are just intrinsically slow to construct and others require up-front hints to avoid waste during their construction. Most of the battle when doing sparse matrix calculations is choosing, or writing, an appropriate data storage format that best suits the problem. Having a toolkit of generic sparse storage formats is good, but orders of magnitude improvement can be obtained with a custom made format. What works for you might be really bad for somebody else. There is no silver bullet sparse matrix format. MTJ has compressed row, compressed column and compressed diagonal formats (with some custom sparse structures for special matrices) and a special doubly linked structure for unstructured sparse matrices. Insertion is usually slow for structures that are optimised for calculations. I recently created a compressed column structure holding a hand-crafted primitive linked list... optimised for insertion speed, but potentially duplicating entries, and definitely not optimal for row traversal (although column traversal would be reasonable). If insertion is to be performed concurrently, then that again requires more thought. I don't know what you mean by sequential access because that depends on how you're doing the sequential ordering: by row, column, diagonal, arrowhead? To optimise calculations, you want the JIT to be able to treat everything as close as possible to aligned arrays to help the CPU caches (which is by far the biggest bottleneck in linear algebra). Exotic data structures can look great on paper, but the JIT just gets confused and all of a sudden you're into RAM access. Btw, 1% sparsity is actually quite dense when you're talking about really large matrices. I think whatever benchmarks you're coming up with, they are probably going to be highly biased to the kind of problems that you're solving. Creating a suite of benchmarks would be quite the undertaking! Basically, I'm saying that Spark should remain flexible for future sparse formats and shouldn't pick one at this early stage because it works well for logistic regression.
If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA.