Repository: incubator-systemml Updated Branches: refs/heads/master 3ef044092 -> 6bce4dfda
[SYSTEMML-594] Adding Flight Delay Demo notebook Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/6bce4dfd Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/6bce4dfd Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/6bce4dfd Branch: refs/heads/master Commit: 6bce4dfdae69a256f84b439029c3fd89275e5601 Parents: 3ef0440 Author: Niketan Pansare <[email protected]> Authored: Tue Apr 26 15:49:09 2016 -0700 Committer: Niketan Pansare <[email protected]> Committed: Tue Apr 26 15:49:09 2016 -0700 ---------------------------------------------------------------------- samples/jupyter-notebooks/Flight_Delay_Demo.ipynb | 1 + 1 file changed, 1 insertion(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/6bce4dfd/samples/jupyter-notebooks/Flight_Delay_Demo.ipynb ---------------------------------------------------------------------- diff --git a/samples/jupyter-notebooks/Flight_Delay_Demo.ipynb b/samples/jupyter-notebooks/Flight_Delay_Demo.ipynb new file mode 100644 index 0000000..d80b1dc --- /dev/null +++ b/samples/jupyter-notebooks/Flight_Delay_Demo.ipynb @@ -0,0 +1 @@ +{"nbformat_minor": 0, "cells": [{"source": "# Flight Delay Prediction Demo Using SystemML", "cell_type": "markdown", "metadata": {}}, {"source": "This notebook is based on datascientistworkbench.com's tutorial notebook for predicting flight delay.", "cell_type": "markdown", "metadata": {}}, {"source": "## Loading SystemML ", "cell_type": "markdown", "metadata": {}}, {"source": "To use one of the released version, use \"%AddDeps org.apache.systemml systemml 0.9.0-incubating\". To use nightly build, \"%AddJar https://sparktc.ibmcloud.com/repo/latest/SystemML.jar\"\n\nOr you provide SystemML.jar and dependency through commandline when starting the notebook (for example: --packages com.databricks:spark-csv_2.10:1.4.0 --jars SystemML.jar)", "cell_type": "markdown", "metadata": {}}, {"execution_count": 1, "cell_type": "code", "source": "%AddJar https://sparktc.ibmcloud.com/repo/latest/SystemML.jar", "outputs": [{"output_type": "stream", "name": "stdout", "text": "Using cached version of S ystemML.jar\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "Use Spark's CSV package for loading the CSV file", "cell_type": "markdown", "metadata": {}}, {"execution_count": 2, "cell_type": "code", "source": "%AddDeps com.databricks spark-csv_2.10 1.4.0", "outputs": [{"output_type": "stream", "name": "stdout", "text": ":: loading settings :: url = jar:file:/usr/local/spark-kernel/lib/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n:: resolving dependencies :: com.ibm.spark#spark-kernel;working [not transitive]\n\tconfs: [default]\n\tfound com.databricks#spark-csv_2.10;1.4.0 in central\n:: resolution report :: resolve 93ms :: artifacts dl 5ms\n\t:: modules in use:\n\tcom.databricks#spark-csv_2.10;1.4.0 from central in [default]\n\t---------------------------------------------------------------------\n\t| | modules || artifacts |\n\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n\t---- -----------------------------------------------------------------\n\t| default | 1 | 0 | 0 | 0 || 1 | 0 |\n\t---------------------------------------------------------------------\n:: retrieving :: com.ibm.spark#spark-kernel\n\tconfs: [default]\n\t0 artifacts copied, 1 already retrieved (0kB/6ms)\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Import Data", "cell_type": "markdown", "metadata": {"collapsed": true}}, {"source": "Download the airline dataset from stat-computing.org if not already downloaded", "cell_type": "markdown", "metadata": {}}, {"execution_count": 3, "cell_type": "code", "source": "import sys.process._\nimport java.net.URL\nimport java.io.File\nval url = \"http://stat-computing.org/dataexpo/2009/2007.csv.bz2\"\nval localFilePath = \"airline2007.csv.bz2\"\nif(!new java.io.File(localFilePath).exists) {\n new URL(url) #> new File(localFilePath) !!\n}", "outputs": [], "metadata": {"collapsed": false, "truste d": true}}, {"source": "Load the dataset into DataFrame using Spark CSV package", "cell_type": "markdown", "metadata": {}}, {"execution_count": 4, "cell_type": "code", "source": "import org.apache.spark.sql.SQLContext\nval sqlContext = new SQLContext(sc)\nval fmt = sqlContext.read.format(\"com.databricks.spark.csv\")\nval opt = fmt.options(Map(\"header\"->\"true\", \"inferSchema\"->\"true\", \"nullValue\"->\"null\", \"treatEmptyValuesAsNulls\"->\"true\"))\nval airline = opt.load(localFilePath)", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 5, "cell_type": "code", "source": "airline.printSchema", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Data Exploration\nWhich airports have the most delays?", "cell_type": "markdown", "metadata": {}}, {"execution_count": 6, "cell_type": "code", "source": "airline.registerTempTable(\"airline\")\nsqlContext.sql(\"\"\"SELECT Origin, count(*) conFlight, avg(DepDelay) delay\ n FROM airline \n GROUP BY Origin\n ORDER BY delay DESC\"\"\").show", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "## Modeling: Logistic Regression\n\nPredict departure delays of flights from JFK", "cell_type": "markdown", "metadata": {}}, {"execution_count": 7, "cell_type": "code", "source": "val smallAirlineData = sqlContext.sql(\"SELECT * FROM airline WHERE Origin='JFK'\").na.fill(0.0, Seq(\"DepDelay\"))\nval datasets = smallAirlineData.withColumnRenamed(\"DepDelay\", \"label\").randomSplit(Array(0.7, 0.3))\nval trainDataset = datasets(0).cache\nval testDataset = datasets(1).cache", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 8, "cell_type": "code", "source": "trainDataset.count", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": null, "cell_type": "code", "source": "testDataset.count", "outputs": [], "metadata" : {"collapsed": false, "trusted": true}}, {"source": "### Feature selection", "cell_type": "markdown", "metadata": {}}, {"source": "Encode the destination using one-hot encoding and include the columns Year, Month, DayofMonth, DayOfWeek, Distance", "cell_type": "markdown", "metadata": {}}, {"execution_count": 14, "cell_type": "code", "source": "import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler}\n\nval indexer = new StringIndexer().setInputCol(\"Dest\").setOutputCol(\"DestIndex\")\nval encoder = new OneHotEncoder().setInputCol(\"DestIndex\").setOutputCol(\"DestVec\")\nval assembler = new VectorAssembler().setInputCols(Array(\"Year\",\"Month\",\"DayofMonth\",\"DayOfWeek\",\"Distance\",\"DestVec\")).setOutputCol(\"features\")", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "### Build the model: Use SystemML's MLPipeline wrapper. \n\nThis wrapper invokes MultiLogReg.dml (for training) and GLM-predict.dml (for prediction). Th ese DML algorithms are available at https://github.com/apache/incubator-systemml/tree/master/scripts/algorithms", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "import org.apache.spark.ml.Pipeline\nimport org.apache.sysml.api.ml.LogisticRegression\n\nval lr = new LogisticRegression(\"log\", sc).setRegParam(1e-4).setTol(1e-2).setMaxInnerIter(5).setMaxOuterIter(5)\n\nval pipeline = new Pipeline().setStages(Array(indexer, encoder, assembler, lr))\nval model = pipeline.fit(trainDataset)", "outputs": [{"output_type": "stream", "name": "stdout", "text": "BEGIN MULTINOMIAL LOGISTIC REGRESSION SCRIPT\nReading X...\nReading Y...\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"source": "### Evaluate the model \n\nOutput RMS error on test data", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "val predictions = model.transform(testDataset.withColumnRenamed(\"label\", \"OriginalLa bel\"))\npredictions.registerTempTable(\"predictions\")\nsqlContext.sql(\"SELECT sqrt(avg(pow(OriginalLabel - label, 2.0))) FROM predictions\").show", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": "### Perform k-fold cross-validation to tune the hyperparameters\n\nPerform cross-validation to tune the regularization parameter for Logistic regression.", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator\nimport org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}\n\nval crossval = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)\nval paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 1e-3, 1e-6)).build()\ncrossval.setEstimatorParamMaps(paramGrid)\ncrossval.setNumFolds(2) // Setting k = 2\nval cvmodel = crossval.fit(trainDataset)", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "### Evaluate the cross-validated model", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, "cell_type": "code", "source": "val cvpredictions = cvmodel.transform(testDataset.withColumnRenamed(\"label\", \"OriginalLabel\"))\ncvpredictions.registerTempTable(\"cvpredictions\")\nsqlContext.sql(\"SELECT sqrt(avg(pow(OriginalLabel - label, 2.0))) FROM cvpredictions\").show", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": "## Homework ;)\n\nRead http://apache.github.io/incubator-systemml/algorithms-classification.html#multinomial-logistic-regression and perform cross validation on other hyperparameters: for example: icpt, tol, maxOuterIter, maxInnerIter", "cell_type": "markdown", "metadata": {}}], "nbformat": 4, "metadata": {"kernelspec": {"display_name": "Scala 2.10.4 (Spark 1.5.2)", "name": "spark", "language": "scala"}, "language_info": {"name": "scala"}}} \ No newline at end of file
