[GitHub] bhavinthaker commented on a change in pull request #7921: Add three sparse tutorials

git Sun, 24 Sep 2017 19:44:42 -0700

bhavinthaker commented on a change in pull request #7921: Add three sparse 
tutorials
URL: https://github.com/apache/incubator-mxnet/pull/7921#discussion_r140672649


 ##########
 File path: docs/tutorials/sparse/csr.md
 ##########
 @@ -0,0 +1,338 @@
+
+# CSRNDArray - NDArray in Compressed Sparse Row Storage Format
+
+Many real world datasets deal with high dimensional sparse feature vectors. 
Take for instance a recommendation system where the number of categories and 
users is on the order of millions. The purchase data for each category by user 
would show that most users only make a few purchases, leading to a dataset with 
high sparsity (i.e. most of the elements are zeros).
+
+Storing and manipulating such large sparse matrices in the default dense 
structure results in wasted memory and processing on the zeros. To take 
advantage of the sparse structure of the matrix, the `CSRNDArray` in MXNet 
stores the matrix in [compressed sparse row 
(CSR)](https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29)
 format and uses specialized algorithms in operators.
+**The format is designed for 2D matrices with a large number of columns,
+and each row is sparse (i.e. with only a few nonzeros).**
+
+## Advantages of Compressed Sparse Row NDArray (CSRNDArray)
+For matrices of high sparsity (e.g. ~1% non-zeros), there are two primary 
advantages of `CSRNDArray` over the existing `NDArray`:
+
+- memory consumption is reduced significantly
+- certain operations are much faster (e.g. matrix-vector multiplication)
+
+You may be familiar with the CSR storage format in 
[SciPy](https://www.scipy.org/) and will note the similarities in MXNet's 
implementation. However there are some additional competitive features in 
`CSRNDArray` inherited from `NDArray`, such as lazy evaluation and automatic 
parallelization that are not available in SciPy's flavor of CSR.
+
+The introduction of `CSRNDArray` also brings a new attribute, `stype` as a 
holder for storage type info, to `NDArray`. You can query **ndarray.stype** now 
in addition to the oft-queried attributes such as **ndarray.shape**, 
**ndarray.dtype**, and **ndarray.context**. For a typical dense NDArray, the 
value of `stype` is **"default"**. For a `CSRNDArray`, the value of stype is 
**"csr"**.
+
+## Prerequisites
+
+To complete this tutorial, you will need:
+
+- MXNet. See the instructions for your operating system in [Setup and 
Installation](http://mxnet.io/get_started/install.html)
+- [Jupyter](http://jupyter.org/)
+    ```
+    pip install jupyter
+    ```
+- Basic knowledge of NDArray in MXNet. See the detailed tutorial for NDArray 
in [NDArray - Imperative tensor operations on 
CPU/GPU](https://mxnet.incubator.apache.org/tutorials/basic/ndarray.html).
+- SciPy - A section of this tutorial uses SciPy package in Python. If you 
don't have SciPy, the example in that section will be ignored.
+- GPUs - A section of this tutorial uses GPUs. If you don't have GPUs on your 
machine, simply set the variable `gpu_device` (set in the GPUs section of this 
tutorial) to `mx.cpu()`.
+
+## Compressed Sparse Row Matrix
+
+A CSRNDArray represents a 2D matrix as three separate 1D arrays: **data**, 
**indptr** and **indices**, where the column indices for row `i` are stored in 
`indices[indptr[i]:indptr[i+1]]` in ascending order, and their corresponding 
values are stored in `data[indptr[i]:indptr[i+1]]`.
+
+- **data**: CSR format data array of the matrix
+- **indices**: CSR format index array of the matrix
+- **indptr**: CSR format index pointer array of the matrix
+
+### Example Matrix Compression
+
+For example, given the matrix:
+```
+[[7, 0, 8, 0]
+ [0, 0, 0, 0]
+ [0, 9, 0, 0]]
+```
+
+We can compress this matrix using CSR, and to do so we need to calculate 
`data`, `indices`, and `indptr`.
+
+The `data` array holds all the non-zero entries of the matrix in row-major 
order. Put another way, you create a data array that has all of the zeros 
removed from the matrix, row by row, storing the numbers in that order. Your 
result:
+
+    data = [7, 8, 9]
+
+The `indices` array stores the column index for each non-zero element in 
`data`. As you cycle through the data array, starting with 7, you can see it is 
in column 0. Then looking at 8, you can see it is in column 2. Lastly 9 is in 
column 1. Your result:
+
+    indices = [0, 2, 1]
+
+The `indptr` array is what will help identify the rows where the data appears. 
It stores the index into `data` of the first non-zero element number of each 
row of the matrix. This array always starts with 0 (reasons can be explored 
later), so indptr[0] is 0. Each subsequent value in the array is the aggregate 
number of non-zero elements up to that row. Looking at the first row of the 
matrix you can see two non-zero values, so indptr[1] is 2. The next row 
contains all zeros, so the aggregate is still 2, so indptr[2] is 2. Finally, 
you see the last row contains one non-zero element bring the aggregate to 3, so 
indptr[3] is 3. Your result:
+
+    indptr = [0, 2, 2, 3]
+
+Note that in MXNet, the column indices for a given row are always sorted in 
ascending order,
+and duplicated column indices for the same row are not allowed.
+
+## Array Creation
+
+There are a few different ways to create a `CSRNDArray`, but first let's 
recreate the matrix we just discussed using the `data`, `indices`, and `indptr` 
we calculated.
+
+* We can create a CSRNDArray with data, indices and indptr by using the 
`csr_matrix` function:
+
+
+```python
+import mxnet as mx
+# Create a CSRNDArray with python lists
+shape = (3, 4)
+data_list = [7, 8, 9]
+indptr_list = [0, 2, 2, 3]
+indices_list = [0, 2, 1]
+a = mx.nd.sparse.csr_matrix(data_list, indptr_list, indices_list, shape)
+# Inspect the matrix
+a.asnumpy()
+```
+
+
+```python
+import numpy as np
+# Create a CSRNDArray with numpy arrays
+data_np = np.array([7, 8, 9])
+indptr_np = np.array([0, 2, 2, 3])
+indices_np = np.array([0, 2, 1])
+b = mx.nd.sparse.csr_matrix(data_np, indptr_np, indices_np, shape)
+b.asnumpy()
+```
+
+
+```python
+# Compare the two
 
 Review comment:
   Suggested text: "Compare the two. They are exactly the same."
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] bhavinthaker commented on a change in pull request #7921: Add three sparse tutorials

Reply via email to