[GitHub] bhavinthaker commented on a change in pull request #7921: Add three sparse tutorials

git Sun, 24 Sep 2017 19:45:13 -0700

bhavinthaker commented on a change in pull request #7921: Add three sparse 
tutorials
URL: https://github.com/apache/incubator-mxnet/pull/7921#discussion_r140671457

##########
File path: docs/tutorials/sparse/csr.md
##########
@@ -0,0 +1,338 @@
+
+# CSRNDArray - NDArray in Compressed Sparse Row Storage Format
+
+Many real world datasets deal with high dimensional sparse feature vectors.
Take for instance a recommendation system where the number of categories and
users is on the order of millions. The purchase data for each category by user
would show that most users only make a few purchases, leading to a dataset with
high sparsity (i.e. most of the elements are zeros).
+
+Storing and manipulating such large sparse matrices in the default dense
structure results in wasted memory and processing on the zeros. To take
advantage of the sparse structure of the matrix, the `CSRNDArray` in MXNet
stores the matrix in [compressed sparse row
(CSR)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29)
format and uses specialized algorithms in operators.
+**The format is designed for 2D matrices with a large number of columns,
+and each row is sparse (i.e. with only a few nonzeros).**
+
+## Advantages of Compressed Sparse Row NDArray (CSRNDArray)
+For matrices of high sparsity (e.g. ~1% non-zeros), there are two primary
advantages of `CSRNDArray` over the existing `NDArray`:
+
+- memory consumption is reduced significantly
+- certain operations are much faster (e.g. matrix-vector multiplication)
+
+You may be familiar with the CSR storage format in
[SciPy](https://www.scipy.org/) and will note the similarities in MXNet's
implementation. However there are some additional competitive features in
`CSRNDArray` inherited from `NDArray`, such as lazy evaluation and automatic
parallelization that are not available in SciPy's flavor of CSR.
+

Review comment:
It is possible that some newbie user may read the Sparse tutorial without
having background on the terms. Hmm, do we want to define the terms here, even
though one could argue that these are easy-to-understand terms?

Lazy evaluation means that any operations on the CSRNDArray are not
performed immediately, but are delayed until their evaluation is specifically
requested. Automatic parallelization means that all operations are
automatically executed in parallel with each other without requiring explicit
mention.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

With regards,
Apache Git Services

[GitHub] bhavinthaker commented on a change in pull request #7921: Add three sparse tutorials

Reply via email to