[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

fommil Sun, 16 Feb 2014 13:59:18 -0800

Github user fommil commented on the pull request:

    https://github.com/apache/incubator-spark/pull/575#issuecomment-35216981
  
    @martinjaggi I believe you would be making a massive mistake by agreeing on 
a single serialisation scheme for sparse vectors, unless that format is 
independent of the storage structure, e.g. MatrixMarket, or if you're happy to 
pass implementation detail of the sparse format along with the structure. BTW, 
solving the generic problem of building a sparse matrix from a data stream is 
far from solved... some structures are just intrinsically slow to construct and 
others require up-front hints to avoid waste during their construction.
    
    Most of the battle when doing sparse matrix calculations is choosing, or 
writing, an appropriate data storage format that best suits the problem. Having 
a toolkit of generic sparse storage formats is good, but orders of magnitude 
improvement can be obtained with a custom made format. What works for you might 
be really bad for somebody else.
    
    There is no silver bullet sparse matrix format.
    
    MTJ has compressed row, compressed column and compressed diagonal formats 
(with some custom sparse structures for special matrices) and a special doubly 
linked structure for unstructured sparse matrices.
    
    Insertion is usually slow for structures that are optimised for 
calculations. I recently created a compressed column structure holding a 
hand-crafted primitive linked list... optimised for insertion speed, but 
potentially duplicating entries, and definitely not optimal for row traversal 
(although column traversal would be reasonable). If insertion is to be 
performed concurrently, then that again requires more thought.
    
    I don't know what you mean by sequential access because that depends on how 
you're doing the sequential ordering: by row, column, diagonal, arrowhead? To 
optimise calculations, you want the JIT to be able to treat everything as close 
as possible to aligned arrays to help the CPU caches (which is by far the 
biggest bottleneck in linear algebra). Exotic data structures can look great on 
paper, but the JIT just gets confused and all of a sudden you're into RAM 
access.
    
    Btw, 1% sparsity is actually quite dense when you're talking about really 
large matrices. I think whatever benchmarks you're coming up with, they are 
probably going to be highly biased to the kind of problems that you're solving. 
Creating a suite of benchmarks would be quite the undertaking!
    
    Basically, I'm saying that Spark should remain flexible for future sparse 
formats and shouldn't pick one at this early stage because it works well for 
logistic regression.




If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. To do so, please top-post your response.
If your project does not have this feature enabled and wishes so, or if the
feature is enabled but not working, please contact infrastructure at
infrastruct...@apache.org or file a JIRA ticket with INFRA.

[GitHub] incubator-spark pull request: [Proposal] Adding sparse data suppor...

Reply via email to