Repository: incubator-systemml Updated Branches: refs/heads/master 33ebe969b -> 766cc48c0
[MINOR] Updating the MNIST LeNet example notebook. This improves the nomenclature & documentation, adds more cleanup, removes the unnecessary saving of the MNIST data to CSV files, and makes everything more Pythonic. Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/766cc48c Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/766cc48c Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/766cc48c Branch: refs/heads/master Commit: 766cc48c00adba37fde9a34a3a06f6df8708205d Parents: 33ebe96 Author: Mike Dusenberry <[email protected]> Authored: Fri Jun 2 18:37:06 2017 -0700 Committer: Mike Dusenberry <[email protected]> Committed: Fri Jun 2 18:37:06 2017 -0700 ---------------------------------------------------------------------- .../Deep_Learning_Image_Classification.ipynb | 270 ++++++------------- 1 file changed, 88 insertions(+), 182 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/766cc48c/samples/jupyter-notebooks/Deep_Learning_Image_Classification.ipynb ---------------------------------------------------------------------- diff --git a/samples/jupyter-notebooks/Deep_Learning_Image_Classification.ipynb b/samples/jupyter-notebooks/Deep_Learning_Image_Classification.ipynb index 3124028..fa1e4b8 100644 --- a/samples/jupyter-notebooks/Deep_Learning_Image_Classification.ipynb +++ b/samples/jupyter-notebooks/Deep_Learning_Image_Classification.ipynb @@ -6,20 +6,18 @@ "source": [ "# Deep Learning Image Classification using Apache SystemML\n", "\n", - "This notebook shows SystemML Deep Learning functionality to map images of single digit numbers to their corresponding numeric representations. See [Getting Started with Deep Learning and Python](http://www.pyimagesearch.com/2014/09/22/getting-started-deep-learning-python/) for an explanation of the used deep learning concepts and assumptions.\n", + "This notebook demonstrates how to train a deep learning model on SystemML for the classic [MNIST](http://yann.lecun.com/exdb/mnist/) problem of mapping images of single digit numbers to their corresponding numeric representations, using a classic [LeNet](http://yann.lecun.com/exdb/lenet/)-like convolutional neural network model. See [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/chap6.html) for more information on neural networks and deep learning.\n", "\n", - "The downloaded MNIST dataset contains labeled images of handwritten digits, where each example is a 28x28 pixel image of grayscale values in the range [0,255] stretched out as 784 pixels, and each label is one of 10 possible digits in [0,9]. We download 60,000 training examples, and 10,000 test examples, where the format is \"label, pixel_1, pixel_2, ..., pixel_n\". We train a SystemML LeNet model. The results of the learning algorithms have an accuracy of 98 percent.\n", + "The downloaded MNIST dataset contains labeled images of handwritten digits, where each example is a 28x28 pixel image of grayscale values in the range [0,255] stretched out as 784 pixels, and each label is one of 10 possible digits in [0,9]. We download 60,000 training examples, and 10,000 test examples, where the images and labels are stored in separate matrices. We then train a SystemML LeNet-like convolutional neural network (i.e. \"convnet\", \"CNN\") model. The resulting trained model has an accuracy of 98.6% on the test dataset.\n", "\n", - "1. [Download and Access MNIST data](#access_data)\n", + "1. [Download the MNIST data](#download_data)\n", "1. [Train a CNN classifier for MNIST handwritten digits](#train)\n", - "1. [Detect handwritten Digits](#predict)\n" + "1. [Detect handwritten Digits](#predict)" ] }, { "cell_type": "markdown", - "metadata": { - "collapsed": true - }, + "metadata": {}, "source": [ "<div style=\"text-align:center\" markdown=\"1\">\n", "\n", @@ -31,14 +29,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### This notebook is supported with SystemML 0.14.0 and above." + "### Note: This notebook is supported with SystemML 0.14.0 and above." ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "scrolled": false + "collapsed": true }, "outputs": [], "source": [ @@ -51,61 +49,36 @@ "metadata": {}, "outputs": [], "source": [ - "from systemml import MLContext, dml\n", - "\n", - "ml = MLContext(sc)\n", + "%matplotlib inline\n", "\n", - "print (\"Spark Version:\" + sc.version)\n", - "print (\"SystemML Version:\" + ml.version())\n", - "print (\"SystemML Built-Time:\" + ml.buildTime())" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "import warnings\n", - "warnings.filterwarnings(\"ignore\")\n", - "from sklearn import datasets\n", - "from sklearn.cross_validation import train_test_split\n", - "from sklearn.metrics import classification_report\n", - "import pandas as pd\n", - "import numpy as np\n", "import matplotlib.pyplot as plt\n", - "#import matplotlib.image as mpimg\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create data directory." + "import numpy as np\n", + "import pandas as pd\n", + "from sklearn import datasets\n", + "from sklearn.cross_validation import train_test_split # module deprecated in 0.18\n", + "#from sklearn.model_selection import train_test_split # use this module for >=0.18\n", + "from sklearn import metrics\n", + "from systemml import MLContext, dml" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "collapsed": true - }, + "metadata": {}, "outputs": [], "source": [ - "%%sh\n", - "mkdir -p data/mnist/\n", - "cd data/mnist/" + "ml = MLContext(sc)\n", + "print(\"Spark Version: {}\".format(sc.version))\n", + "print(\"SystemML Version: {}\".format(ml.version()))\n", + "print(\"SystemML Built-Time: {}\".format(ml.buildTime()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "<a id=\"access_data\"></a>\n", - "## Download and Access MNIST data\n", + "<a id=\"download_data\"></a>\n", + "## Download the MNIST data\n", "\n", "Download the [MNIST data from the MLData repository](http://mldata.org/repository/data/viewslug/mnist-original/), and then split and save." ] @@ -113,87 +86,28 @@ { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": true - }, + "metadata": {}, "outputs": [], "source": [ "mnist = datasets.fetch_mldata(\"MNIST Original\")\n", "\n", - "print (\"Mnist data features:\" + str(mnist.data.shape))\n", - "print (\"Mnist data label:\" + str(mnist.target.shape))\n", - "\n", - "trainX, testX, trainY, testY = train_test_split(mnist.data, mnist.target.astype(\"int0\"), test_size = 0.142857)\n", - "\n", - "trainD = np.concatenate((trainY.reshape(trainY.size, 1), trainX),axis=1)\n", - "testD = np.concatenate((testY.reshape (testY.size, 1), testX),axis=1)\n", + "print(\"MNIST data features: {}\".format(mnist.data.shape))\n", + "print(\"MNIST data labels: {}\".format(mnist.target.shape))\n", "\n", - "print (\"Images for training:\" + str(trainD.shape))\n", - "print (\"Images used for testing:\" + str(testD.shape))\n", - "pix = int(np.sqrt(trainD.shape[1]))\n", - "print (\"Each image is: \" + str(pix) + \" by \" + str(pix) + \" pixels\")\n", + "X_train, X_test, y_train, y_test = train_test_split(\n", + " mnist.data, mnist.target.astype(np.uint8).reshape(-1, 1),\n", + " test_size = 10000)\n", "\n", - "np.savetxt('data/mnist/mnist_train.csv', trainD, fmt='%u', delimiter=\",\")\n", - "np.savetxt('data/mnist/mnist_test.csv', testD, fmt='%u', delimiter=\",\")" + "print(\"Training images, labels: {}, {}\".format(X_train.shape, y_train.shape))\n", + "print(\"Testing images, labels: {}, {}\".format(X_test.shape, y_test.shape))\n", + "print(\"Each image is: {0:d}x{0:d} pixels\".format(int(np.sqrt(X_train.shape[1]))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Alternatively get the data from here. (Uncomment curl commands from following cell if you want to download using following approach)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "%%sh\n", - "cd data/mnist\n", - "# curl -O https://pjreddie.com/media/files/mnist_train.csv\n", - "# curl -O https://pjreddie.com/media/files/mnist_test.csv\n", - "wc -l mnist*" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Read the data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "trainData = np.genfromtxt('data/mnist/mnist_train.csv', delimiter=\",\")\n", - "testData = np.genfromtxt('data/mnist/mnist_test.csv', delimiter=\",\")\n", - "\n", - "print (\"Training data: \" + str(trainData.shape))\n", - "print (\"Test data: \" + str(testData.shape))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "pd.set_option('display.max_columns', 200)\n", - "pd.DataFrame(testData[1:10,],dtype='uint')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Following command is not required for code above SystemML 0.14 (master branch dated 05/15/2017 or later)" + "### Note: The following command is not required for code above SystemML 0.14 (master branch dated 05/15/2017 or later)." ] }, { @@ -210,7 +124,7 @@ "metadata": {}, "source": [ "<a id=\"train\"></a>\n", - "## Develop LeNet CNN classifier on Training Data" + "## Train a LeNet-like CNN classifier on the training data" ] }, { @@ -227,37 +141,23 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Train Model using SystemML LeNet CNN." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "(on a Mac Book, this takes approx. 5-6 mins for 1 epoch)" + "### Train a LeNet-like CNN model using SystemML" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "collapsed": true, - "scrolled": true + "collapsed": true }, "outputs": [], "source": [ "script = \"\"\"\n", " source(\"nn/examples/mnist_lenet.dml\") as mnist_lenet\n", "\n", - " # Bind training data\n", - " n = nrow(data)\n", - "\n", - " # Extract images and labels\n", - " images = data[,2:ncol(data)]\n", - " labels = data[,1]\n", - "\n", " # Scale images to [-1,1], and one-hot encode the labels\n", - " images = (images / 255.0) * 2 - 1\n", + " images = (images / 255) * 2 - 1\n", + " n = nrow(images)\n", " labels = table(seq(1, n), labels+1, n, 10)\n", "\n", " # Split into training (55,000 examples) and validation (5,000 examples)\n", @@ -266,49 +166,56 @@ " y = labels[5001:nrow(images),]\n", " y_val = labels[1:5000,]\n", "\n", - " # Train the model using channel, height, and width to produce weights/biases.\n", + " # Train the model to produce weights & biases.\n", " [W1, b1, W2, b2, W3, b3, W4, b4] = mnist_lenet::train(X, y, X_val, y_val, C, Hin, Win, epochs)\n", "\"\"\"\n", - "rets = ('W1', 'b1','W2','b2','W3','b3','W4','b4')\n", + "out = ('W1', 'b1', 'W2', 'b2', 'W3', 'b3', 'W4', 'b4')\n", + "prog = (dml(script).input(images=X_train, labels=y_train, epochs=1, C=1, Hin=28, Win=28)\n", + " .output(*out))\n", "\n", - "script = (dml(script).input(data=trainData, epochs=1, C=1, Hin=28, Win=28)\n", - " .output(*rets)) \n", - "\n", - "W1, b1, W2, b2, W3, b3, W4, b4 = (ml.execute(script).get(*rets))" + "W1, b1, W2, b2, W3, b3, W4, b4 = ml.execute(prog).get(*out)" ] }, { "cell_type": "markdown", - "metadata": { - "collapsed": true - }, + "metadata": {}, "source": [ - "Use trained model and predict on test data, and evaluate the quality of the predictions for each digit." + "Use the trained model to make predictions for the test data, and evaluate the quality of the predictions." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ - "scriptPredict = \"\"\"\n", + "script_predict = \"\"\"\n", " source(\"nn/examples/mnist_lenet.dml\") as mnist_lenet\n", "\n", - " # Separate images from lables and scale images to [-1,1]\n", - " X_test = data[,2:ncol(data)]\n", - " X_test = (X_test / 255.0) * 2 - 1\n", + " # Scale images to [-1,1]\n", + " X_test = (X_test / 255) * 2 - 1\n", "\n", " # Predict\n", - " probs = mnist_lenet::predict(X_test, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)\n", - " predictions = rowIndexMax(probs) - 1\n", + " y_prob = mnist_lenet::predict(X_test, C, Hin, Win, W1, b1, W2, b2, W3, b3, W4, b4)\n", + " y_pred = rowIndexMax(y_prob) - 1\n", "\"\"\"\n", - "script = (dml(scriptPredict).input(data=testData, C=1, Hin=28, Win=28, W1=W1, b1=b1, W2=W2, b2=b2, W3=W3, b3=b3, W4=W4, b4=b4)\n", - " .output(\"predictions\"))\n", + "prog = (dml(script_predict).input(X_test=X_test, C=1, Hin=28, Win=28, W1=W1, b1=b1,\n", + " W2=W2, b2=b2, W3=W3, b3=b3, W4=W4, b4=b4)\n", + " .output(\"y_pred\"))\n", "\n", - "predictions = ml.execute(script).get(\"predictions\").toNumPy()\n", - "\n", - "print (classification_report(testData[:,0], predictions))" + "y_pred = ml.execute(prog).get(\"y_pred\").toNumPy()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(metrics.accuracy_score(y_test, y_pred))\n", + "print(metrics.classification_report(y_test, y_pred))" ] }, { @@ -316,14 +223,14 @@ "metadata": {}, "source": [ "<a id=\"predict\"></a>\n", - "## Detect handwritten Digits" + "## Detect handwritten digits" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Define a function that randomly selects a test image, display the image, and scores it." + "Define a function that randomly selects a test image, displays the image, and scores it." ] }, { @@ -334,11 +241,11 @@ }, "outputs": [], "source": [ - "img_size = int(np.sqrt(testData.shape[1] - 1))\n", + "img_size = int(np.sqrt(X_test.shape[1]))\n", "\n", "def displayImage(i):\n", - " image = (testData[i,1:]).reshape((img_size, img_size)).astype(\"uint8\")\n", - " imgplot = plt.imshow(image, cmap='gray') " + " image = (X_test[i]).reshape(img_size, img_size).astype(np.uint8)\n", + " imgplot = plt.imshow(image, cmap='gray') " ] }, { @@ -350,28 +257,27 @@ "outputs": [], "source": [ "def predictImage(i):\n", - " image = testData[i,:].reshape(1,testData.shape[1])\n", - " prog = dml(scriptPredict).input(data=image, C=1, Hin=28, Win=28, W1=W1, b1=b1, W2=W2, b2=b2, W3=W3, b3=b3, W4=W4, b4=b4) \\\n", - " .output(\"predictions\")\n", - " result = ml.execute(prog)\n", - " return (result.get(\"predictions\").toNumPy())[0]" + " image = X_test[i].reshape(1, -1)\n", + " out = \"y_pred\"\n", + " prog = (dml(script_predict).input(X_test=image, C=1, Hin=28, Win=28, W1=W1, b1=b1,\n", + " W2=W2, b2=b2, W3=W3, b3=b3, W4=W4, b4=b4)\n", + " .output(out))\n", + " pred = int(ml.execute(prog).get(out).toNumPy())\n", + " return pred" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": true - }, + "metadata": {}, "outputs": [], "source": [ - "i = np.random.choice(np.arange(0, len(testData)), size = (1,))\n", - "\n", + "i = np.random.randint(len(X_test))\n", "p = predictImage(i)\n", "\n", - "print (\"Image \" + str(i) + \"\\nPredicted digit: \" + str(p) + \"\\nActual digit: \" + str(testData[i,0]) + \"\\nResult: \" + str(p == testData[i,0]))\n", + "print(\"Image {}\\nPredicted digit: {}\\nActual digit: {}\\nResult: {}\".format(\n", + " i, p, int(y_test[i]), p == int(y_test[i])))\n", "\n", - "p\n", "displayImage(i)" ] }, @@ -382,27 +288,27 @@ "outputs": [], "source": [ "pd.set_option('display.max_columns', 28)\n", - "pd.DataFrame((testData[i,1:]).reshape(img_size, img_size),dtype='uint')" + "pd.DataFrame((X_test[i]).reshape(img_size, img_size), dtype='uint')" ] } ], "metadata": { "kernelspec": { - "display_name": "Python 2", + "display_name": "Python 3 + Spark 2.x + SystemML", "language": "python", - "name": "python2" + "name": "pyspark3_2.x" }, "language_info": { "codemirror_mode": { "name": "ipython", - "version": 2 + "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", - "pygments_lexer": "ipython2", - "version": "2.7.13" + "pygments_lexer": "ipython3", + "version": "3.6.1" } }, "nbformat": 4,
