bhavinthaker commented on a change in pull request #7921: Add three sparse 
tutorials
URL: https://github.com/apache/incubator-mxnet/pull/7921#discussion_r140671457
 
 

 ##########
 File path: docs/tutorials/sparse/csr.md
 ##########
 @@ -0,0 +1,338 @@
+
+# CSRNDArray - NDArray in Compressed Sparse Row Storage Format
+
+Many real world datasets deal with high dimensional sparse feature vectors. 
Take for instance a recommendation system where the number of categories and 
users is on the order of millions. The purchase data for each category by user 
would show that most users only make a few purchases, leading to a dataset with 
high sparsity (i.e. most of the elements are zeros).
+
+Storing and manipulating such large sparse matrices in the default dense 
structure results in wasted memory and processing on the zeros. To take 
advantage of the sparse structure of the matrix, the `CSRNDArray` in MXNet 
stores the matrix in [compressed sparse row 
(CSR)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29)
 format and uses specialized algorithms in operators.
+**The format is designed for 2D matrices with a large number of columns,
+and each row is sparse (i.e. with only a few nonzeros).**
+
+## Advantages of Compressed Sparse Row NDArray (CSRNDArray)
+For matrices of high sparsity (e.g. ~1% non-zeros), there are two primary 
advantages of `CSRNDArray` over the existing `NDArray`:
+
+- memory consumption is reduced significantly
+- certain operations are much faster (e.g. matrix-vector multiplication)
+
+You may be familiar with the CSR storage format in 
[SciPy](https://www.scipy.org/) and will note the similarities in MXNet's 
implementation. However there are some additional competitive features in 
`CSRNDArray` inherited from `NDArray`, such as lazy evaluation and automatic 
parallelization that are not available in SciPy's flavor of CSR.
+
 
 Review comment:
   It is possible that some newbie user may read the Sparse tutorial without 
having background on the terms. Hmm, do we want to define the terms here, even 
though one could argue that these are easy-to-understand terms?
   
   Lazy evaluation means that any operations on the CSRNDArray are not 
performed immediately, but are delayed until their evaluation is specifically 
requested. Automatic parallelization means that all operations are 
automatically executed in parallel with each other without requiring explicit 
mention.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to