[GitHub] [systemds] Baunsgaard commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

GitBox Wed, 23 Sep 2020 05:44:03 -0700


Baunsgaard commented on a change in pull request #1061:
URL: https://github.com/apache/systemds/pull/1061#discussion_r493544389




##########
File path: src/main/python/docs/source/guide/algorithms.rst
##########
@@ -26,18 +26,177 @@ Prerequisite:
 
 - :doc:`/getting_started/install`
 
+This example goes through an algorithm from the list of builtin algorithms 
that can be applied to a dataset.
+For simplicity the dataset used for this is `MNIST 
<http://yann.lecun.com/exdb/mnist/>`_,
+since it is commonly known and explored.
+
+If one wants to skip the explanation then the full script is available at the 
bottom of this page.
 
 Step 1: Get Dataset
 -------------------
 
-TODO
+SystemDS provides builtin for downloading and setup of the MNIST dataset.
+To setup this simply use::
 
-Step 2: Train model
--------------------
+    from systemds.examples.tutorials.mnist import DataManager
+    d = DataManager()
+    X = d.get_train_data()
+    Y = d.get_train_labels()
+
+Here the DataManager contains the code for downloading and setting up numpy 
arrays containing the data.
+
+Step 2: Reshape & Format
+------------------------
+
+Usually data does not come in formats that perfectly fits the algorithms, to 
make this tutorial more
+realistic some data preprocessing is required to change the input to fit.
+
+First the Training data, X, has multiple dimensions resulting in a shape 
(60000, 28, 28).
+The dimensions correspond to first the number of images 60000, then the number 
of row pixels, 28,
+and finally the column pixels, 28.
+
+To use this data for Logistic Regression we have to reduce the dimensions.
+The input X is the training data. 
+It require the data to have two dimensions, the first resemble the
+number of inputs, and the other the number of features.
+
+Therefore to make the data fit the algorithm we reshape the X dataset, like 
so::
+
+    X = X.reshape((60000, 28*28))
+
+This takes each row of pixels and append to each other making a single feature 
vector per image.
+
+The Y dataset also does not perfectly fit the Logistic Regression algorithm, 
this is because the labels
+for this dataset is values ranging from 0, to 9, each label correspond to the 
integer shown in the image.
+unfortunately the algorithm require the labels to be distinct integers from 1 
and upwards.
+
+Therefore we add 1 to each label such that the labels go from 1 to 10, like 
this::
+
+    Y = Y + 1
+
+With these steps we are now ready to train a simple logistic model.
+
+Step 3: Training
+----------------
+
+To start with, we setup a SystemDS context::
+
+    from systemds.context import SystemDSContext
+    sds = SystemDSContext()

Review comment:
       should be resolved now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] Baunsgaard commented on a change in pull request #1061: [SYSTEMDS-2669] Python Mnist LogReg Tutorial

Reply via email to