[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

itg-abby Fri, 04 Nov 2016 16:54:08 -0700

Github user itg-abby commented on the issue:

    https://github.com/apache/spark/pull/15496
  
    @holdenk 
    I did a quick benchmark on my Macbook to get a single CSR's approximate 
construction time (in minutes) on a large case as well as compared to LIL and 
DOK types.
    
    2mil x 2mil Elements:
    CSR,        1.157418
    DOK,        5.531311
    LIL,           0.839315
    LIL2CSR, 1.650390
    
    Snippet used for benchmarking:
    ```
    setup_small1 = 'from scipy.sparse import csr_matrix;'
    method1 = 'csr_matrix((range(0, 2000000),range(0, 2000000),[0, 2000000]))'
    setup_small2 = 'from scipy.sparse import dok_matrix;'
    method2 = 'dok_matrix((range(0, 2000000),range(0, 2000000)))'
    setup_small3 = 'from scipy.sparse import lil_matrix;'
    method3 = 'lil_matrix((2000000,2000000))'
    setup_small4 = 'from scipy.sparse import lil_matrix;'
    method4 = 'lil_matrix((2000000,2000000)).tocsr()'
    
    t = timeit.Timer(method1, setup_small1)
    print('CSR construction: ' + str(t.timeit(100)/float(100)))
    t = timeit.Timer(method2, setup_small2)
    print('DOK construction: ' + str(t.timeit(100)/float(100)))
    t = timeit.Timer(method3, setup_small3)
    print('LIL construction: ' + str(t.timeit(100)/float(100)))
    t = timeit.Timer(method4, setup_small4)
    print('LIL construction + Convert to CSR: ' + str(t.timeit(100)/float(100)))
    ```
    
    I expect any changing of the Sparse Vector structure will take place using 
the PySpark object class, so CSR will definitely outperform LIL and DOK matrix 
types for function execution as well. This comes from the SciPy documentation 
[Advantages of the CSR format - efficient arithmetic operations CSR + CSR, CSR 
* CSR, etc., efficient row slicing, fast matrix vector products / Disadvantages 
of the CSR format - slow column slicing operations (consider CSC), changes to 
the sparsity structure are expensive (consider LIL or DOK)].
    
    
    Other items resolved:
    - The \s have been removed
    - The error message has been lengthened to mention SciPy not being 
available as a possible cause for a lack of object attribute (the other cause 
being a call to something that does not exist in SciPy either).
    - I have only added my single test function to ML since I do not know the 
reason why all of the SciPy tests in MLlib are not in ML.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

Reply via email to