Github user itg-abby commented on the issue:
https://github.com/apache/spark/pull/15496
@holdenk
I did a quick benchmark on my Macbook to get a single CSR's approximate
construction time (in minutes) on a large case as well as compared to LIL and
DOK types.
2mil x 2mil Elements:
CSR, 1.157418
DOK, 5.531311
LIL, 0.839315
LIL2CSR, 1.650390
Snippet used for benchmarking:
```
setup_small1 = 'from scipy.sparse import csr_matrix;'
method1 = 'csr_matrix((range(0, 2000000),range(0, 2000000),[0, 2000000]))'
setup_small2 = 'from scipy.sparse import dok_matrix;'
method2 = 'dok_matrix((range(0, 2000000),range(0, 2000000)))'
setup_small3 = 'from scipy.sparse import lil_matrix;'
method3 = 'lil_matrix((2000000,2000000))'
setup_small4 = 'from scipy.sparse import lil_matrix;'
method4 = 'lil_matrix((2000000,2000000)).tocsr()'
t = timeit.Timer(method1, setup_small1)
print('CSR construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method2, setup_small2)
print('DOK construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method3, setup_small3)
print('LIL construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method4, setup_small4)
print('LIL construction + Convert to CSR: ' + str(t.timeit(100)/float(100)))
```
I expect any changing of the Sparse Vector structure will take place using
the PySpark object class, so CSR will definitely outperform LIL and DOK matrix
types for function execution as well. This comes from the SciPy documentation
[Advantages of the CSR format - efficient arithmetic operations CSR + CSR, CSR
* CSR, etc., efficient row slicing, fast matrix vector products / Disadvantages
of the CSR format - slow column slicing operations (consider CSC), changes to
the sparsity structure are expensive (consider LIL or DOK)].
Other items resolved:
- The \s have been removed
- The error message has been lengthened to mention SciPy not being
available as a possible cause for a lack of object attribute (the other cause
being a call to something that does not exist in SciPy either).
- I have only added my single test function to ML since I do not know the
reason why all of the SciPy tests in MLlib are not in ML.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]