incubator-systemml git commit: [SYSTEMML-594] Adding Flight Delay Demo notebook

niketanpansare Tue, 26 Apr 2016 15:50:10 -0700

Repository: incubator-systemml
Updated Branches:
  refs/heads/master 3ef044092 -> 6bce4dfda



[SYSTEMML-594] Adding Flight Delay Demo notebook

Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/6bce4dfd
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/6bce4dfd
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/6bce4dfd

Branch: refs/heads/master
Commit: 6bce4dfdae69a256f84b439029c3fd89275e5601
Parents: 3ef0440
Author: Niketan Pansare <[email protected]>
Authored: Tue Apr 26 15:49:09 2016 -0700
Committer: Niketan Pansare <[email protected]>
Committed: Tue Apr 26 15:49:09 2016 -0700

----------------------------------------------------------------------
 samples/jupyter-notebooks/Flight_Delay_Demo.ipynb | 1 +
 1 file changed, 1 insertion(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/6bce4dfd/samples/jupyter-notebooks/Flight_Delay_Demo.ipynb
----------------------------------------------------------------------
diff --git a/samples/jupyter-notebooks/Flight_Delay_Demo.ipynb 
b/samples/jupyter-notebooks/Flight_Delay_Demo.ipynb
new file mode 100644
index 0000000..d80b1dc
--- /dev/null
+++ b/samples/jupyter-notebooks/Flight_Delay_Demo.ipynb
@@ -0,0 +1 @@
+{"nbformat_minor": 0, "cells": [{"source": "# Flight Delay Prediction Demo 
Using SystemML", "cell_type": "markdown", "metadata": {}}, {"source": "This 
notebook is based on datascientistworkbench.com's tutorial notebook for 
predicting flight delay.", "cell_type": "markdown", "metadata": {}}, {"source": 
"## Loading SystemML ", "cell_type": "markdown", "metadata": {}}, {"source": 
"To use one of the released version, use \"%AddDeps org.apache.systemml 
systemml 0.9.0-incubating\". To use nightly build, \"%AddJar 
https://sparktc.ibmcloud.com/repo/latest/SystemML.jar\"\n\nOr you provide 
SystemML.jar and dependency through commandline when starting the notebook (for 
example: --packages com.databricks:spark-csv_2.10:1.4.0 --jars SystemML.jar)", 
"cell_type": "markdown", "metadata": {}}, {"execution_count": 1, "cell_type": 
"code", "source": "%AddJar 
https://sparktc.ibmcloud.com/repo/latest/SystemML.jar";, "outputs": 
[{"output_type": "stream", "name": "stdout", "text": "Using cached version of S
 ystemML.jar\n"}], "metadata": {"collapsed": false, "trusted": true}}, 
{"source": "Use Spark's CSV package for loading the CSV file", "cell_type": 
"markdown", "metadata": {}}, {"execution_count": 2, "cell_type": "code", 
"source": "%AddDeps com.databricks spark-csv_2.10 1.4.0", "outputs": 
[{"output_type": "stream", "name": "stdout", "text": ":: loading settings :: 
url = 
jar:file:/usr/local/spark-kernel/lib/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml\n::
 resolving dependencies :: com.ibm.spark#spark-kernel;working [not 
transitive]\n\tconfs: [default]\n\tfound com.databricks#spark-csv_2.10;1.4.0 in 
central\n:: resolution report :: resolve 93ms :: artifacts dl 5ms\n\t:: modules 
in use:\n\tcom.databricks#spark-csv_2.10;1.4.0 from central in 
[default]\n\t---------------------------------------------------------------------\n\t|
                  |            modules            ||   artifacts   |\n\t|       
conf       | number| search|dwnlded|evicted|| number|dwnlded|\n\t----
 -----------------------------------------------------------------\n\t|      
default     |   1   |   0   |   0   |   0   ||   1   |   0   
|\n\t---------------------------------------------------------------------\n:: 
retrieving :: com.ibm.spark#spark-kernel\n\tconfs: [default]\n\t0 artifacts 
copied, 1 already retrieved (0kB/6ms)\n"}], "metadata": {"collapsed": false, 
"trusted": true}}, {"source": "## Import Data", "cell_type": "markdown", 
"metadata": {"collapsed": true}}, {"source": "Download the airline dataset from 
stat-computing.org if not already downloaded", "cell_type": "markdown", 
"metadata": {}}, {"execution_count": 3, "cell_type": "code", "source": "import 
sys.process._\nimport java.net.URL\nimport java.io.File\nval url = 
\"http://stat-computing.org/dataexpo/2009/2007.csv.bz2\"\nval localFilePath = 
\"airline2007.csv.bz2\"\nif(!new java.io.File(localFilePath).exists) {\n    new 
URL(url) #> new File(localFilePath) !!\n}", "outputs": [], "metadata": 
{"collapsed": false, "truste
 d": true}}, {"source": "Load the dataset into DataFrame using Spark CSV 
package", "cell_type": "markdown", "metadata": {}}, {"execution_count": 4, 
"cell_type": "code", "source": "import org.apache.spark.sql.SQLContext\nval 
sqlContext = new SQLContext(sc)\nval fmt = 
sqlContext.read.format(\"com.databricks.spark.csv\")\nval opt = 
fmt.options(Map(\"header\"->\"true\", \"inferSchema\"->\"true\", 
\"nullValue\"->\"null\", \"treatEmptyValuesAsNulls\"->\"true\"))\nval airline = 
opt.load(localFilePath)", "outputs": [], "metadata": {"collapsed": false, 
"trusted": true}}, {"execution_count": 5, "cell_type": "code", "source": 
"airline.printSchema", "outputs": [], "metadata": {"collapsed": false, 
"trusted": true}}, {"source": "## Data Exploration\nWhich airports have the 
most delays?", "cell_type": "markdown", "metadata": {}}, {"execution_count": 6, 
"cell_type": "code", "source": 
"airline.registerTempTable(\"airline\")\nsqlContext.sql(\"\"\"SELECT Origin, 
count(*) conFlight, avg(DepDelay) delay\
 n                    FROM airline \n                    GROUP BY Origin\n      
              ORDER BY delay DESC\"\"\").show", "outputs": [], "metadata": 
{"collapsed": false, "trusted": true}}, {"source": "## Modeling: Logistic 
Regression\n\nPredict departure delays of flights from JFK", "cell_type": 
"markdown", "metadata": {}}, {"execution_count": 7, "cell_type": "code", 
"source": "val smallAirlineData = sqlContext.sql(\"SELECT * FROM airline WHERE 
Origin='JFK'\").na.fill(0.0, Seq(\"DepDelay\"))\nval datasets = 
smallAirlineData.withColumnRenamed(\"DepDelay\", 
\"label\").randomSplit(Array(0.7, 0.3))\nval trainDataset = 
datasets(0).cache\nval testDataset = datasets(1).cache", "outputs": [], 
"metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 8, 
"cell_type": "code", "source": "trainDataset.count", "outputs": [], "metadata": 
{"collapsed": false, "trusted": true}}, {"execution_count": null, "cell_type": 
"code", "source": "testDataset.count", "outputs": [], "metadata"
 : {"collapsed": false, "trusted": true}}, {"source": "### Feature selection", 
"cell_type": "markdown", "metadata": {}}, {"source": "Encode the destination 
using one-hot encoding and include the columns Year, Month, DayofMonth, 
DayOfWeek, Distance", "cell_type": "markdown", "metadata": {}}, 
{"execution_count": 14, "cell_type": "code", "source": "import 
org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, 
VectorAssembler}\n\nval indexer = new 
StringIndexer().setInputCol(\"Dest\").setOutputCol(\"DestIndex\")\nval encoder 
= new OneHotEncoder().setInputCol(\"DestIndex\").setOutputCol(\"DestVec\")\nval 
assembler = new 
VectorAssembler().setInputCols(Array(\"Year\",\"Month\",\"DayofMonth\",\"DayOfWeek\",\"Distance\",\"DestVec\")).setOutputCol(\"features\")",
 "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": 
"### Build the model: Use SystemML's MLPipeline wrapper. \n\nThis wrapper 
invokes MultiLogReg.dml (for training) and GLM-predict.dml (for prediction). Th
 ese DML algorithms are available at 
https://github.com/apache/incubator-systemml/tree/master/scripts/algorithms";, 
"cell_type": "markdown", "metadata": {}}, {"execution_count": null, 
"cell_type": "code", "source": "import org.apache.spark.ml.Pipeline\nimport 
org.apache.sysml.api.ml.LogisticRegression\n\nval lr = new 
LogisticRegression(\"log\", 
sc).setRegParam(1e-4).setTol(1e-2).setMaxInnerIter(5).setMaxOuterIter(5)\n\nval 
pipeline = new Pipeline().setStages(Array(indexer, encoder, assembler, 
lr))\nval model = pipeline.fit(trainDataset)", "outputs": [{"output_type": 
"stream", "name": "stdout", "text": "BEGIN MULTINOMIAL LOGISTIC REGRESSION 
SCRIPT\nReading X...\nReading Y...\n"}], "metadata": {"collapsed": false, 
"trusted": true}}, {"source": "### Evaluate the model \n\nOutput RMS error on 
test data", "cell_type": "markdown", "metadata": {}}, {"execution_count": null, 
"cell_type": "code", "source": "val predictions = 
model.transform(testDataset.withColumnRenamed(\"label\", \"OriginalLa
 
bel\"))\npredictions.registerTempTable(\"predictions\")\nsqlContext.sql(\"SELECT
 sqrt(avg(pow(OriginalLabel - label, 2.0))) FROM predictions\").show", 
"outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"source": 
"### Perform k-fold cross-validation to tune the hyperparameters\n\nPerform 
cross-validation to tune the regularization parameter for Logistic 
regression.", "cell_type": "markdown", "metadata": {}}, {"execution_count": 
null, "cell_type": "code", "source": "import 
org.apache.spark.ml.evaluation.BinaryClassificationEvaluator\nimport 
org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}\n\nval crossval = 
new CrossValidator().setEstimator(pipeline).setEvaluator(new 
BinaryClassificationEvaluator)\nval paramGrid = new 
ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 1e-3, 
1e-6)).build()\ncrossval.setEstimatorParamMaps(paramGrid)\ncrossval.setNumFolds(2)
 // Setting k = 2\nval cvmodel = crossval.fit(trainDataset)", "outputs": [], 
"metadata": {"collapsed": 
 true, "trusted": true}}, {"source": "### Evaluate the cross-validated model", 
"cell_type": "markdown", "metadata": {}}, {"execution_count": null, 
"cell_type": "code", "source": "val cvpredictions = 
cvmodel.transform(testDataset.withColumnRenamed(\"label\", 
\"OriginalLabel\"))\ncvpredictions.registerTempTable(\"cvpredictions\")\nsqlContext.sql(\"SELECT
 sqrt(avg(pow(OriginalLabel - label, 2.0))) FROM cvpredictions\").show", 
"outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"source": 
"## Homework ;)\n\nRead 
http://apache.github.io/incubator-systemml/algorithms-classification.html#multinomial-logistic-regression
 and perform cross validation on other hyperparameters: for example: icpt, tol, 
maxOuterIter, maxInnerIter", "cell_type": "markdown", "metadata": {}}], 
"nbformat": 4, "metadata": {"kernelspec": {"display_name": "Scala 2.10.4 (Spark 
1.5.2)", "name": "spark", "language": "scala"}, "language_info": {"name": 
"scala"}}}
\ No newline at end of file

incubator-systemml git commit: [SYSTEMML-594] Adding Flight Delay Demo notebook

Reply via email to