incubator-zeppelin git commit: ZEPPELIN-55 Make tutorial notebook independent from filesystem.

moon Sun, 05 Jul 2015 10:48:06 -0700

Repository: incubator-zeppelin
Updated Branches:
  refs/heads/master 12e5abf28 -> 4fa701928



ZEPPELIN-55 Make tutorial notebook independent from filesystem.

Tutorial notebook is downloading data using `wget` and unzip and load the csv 
file.
This works only in local-mode and not going to work with cluster deployments.

Discussed solution in the issue ZEPPELIN-55 are

 * Upload data to HDFS
 * Upload data to S3

However, not all user will install HDFS, and accessing S3 via hdfs client needs 
accessKey and secretKey in configuration.

this PR make tutorial notebook independent from any filesystem, by reading data 
from http(s) address and parallelize directly.

Here's how this PR loads data
```
// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new 
URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv";),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, 
balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt,
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")
```

Author: Lee moon soo <[email protected]>

Closes #140 from Leemoonsoo/ZEPPELIN-55 and squashes the following commits:

653b1bc [Lee moon soo] Load data directly from http without using filesystem


Project: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/repo
Commit: 
http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/commit/4fa70192
Tree: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/tree/4fa70192
Diff: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/diff/4fa70192

Branch: refs/heads/master
Commit: 4fa7019286a7f809a94c89c0cc71e767364ac1f3
Parents: 12e5abf
Author: Lee moon soo <[email protected]>
Authored: Fri Jul 3 13:46:35 2015 -0700
Committer: Lee moon soo <[email protected]>
Committed: Sun Jul 5 10:47:01 2015 -0700

----------------------------------------------------------------------
 notebook/2A94M5J1Z/note.json | 101 +++++++++++++++++++-------------------
 1 file changed, 51 insertions(+), 50 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/4fa70192/notebook/2A94M5J1Z/note.json
----------------------------------------------------------------------
diff --git a/notebook/2A94M5J1Z/note.json b/notebook/2A94M5J1Z/note.json
index a37cf19..785ccea 100644
--- a/notebook/2A94M5J1Z/note.json
+++ b/notebook/2A94M5J1Z/note.json
@@ -13,7 +13,7 @@
           "groups": [],
           "scatter": {}
         },
-        "editorHide": false
+        "editorHide": true
       },
       "settings": {
         "params": {},
@@ -33,41 +33,8 @@
       "progressUpdateIntervalMs": 500
     },
     {
-      "title": "Prepare data",
-      "text": "import sys.process._\n//you will need \u0027wget\u0027 tool to 
download\n\"wget 
http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip\"; 
!\n\"mkdir data\" !\n\"unzip bank.zip -d data\" !\n\"rm bank.zip\" !",
-      "config": {
-        "colWidth": 12.0,
-        "graph": {
-          "mode": "table",
-          "height": 300.0,
-          "optionOpen": false,
-          "keys": [],
-          "values": [],
-          "groups": [],
-          "scatter": {}
-        },
-        "title": true
-      },
-      "settings": {
-        "params": {},
-        "forms": {}
-      },
-      "jobName": "paragraph_1417656535623_-196593192",
-      "id": "20141204-102855_1590713432",
-      "result": {
-        "code": "SUCCESS",
-        "type": "TEXT",
-        "msg": "import sys.process._\nwarning: there were 1 feature 
warning(s); re-run with -feature for details\nres1: Int \u003d 0\nwarning: 
there were 1 feature warning(s); re-run with -feature for details\nres2: Int 
\u003d 0\nwarning: there were 1 feature warning(s); re-run with -feature for 
details\nres3: Int \u003d 0\nwarning: there were 1 feature warning(s); re-run 
with -feature for details\nres4: Int \u003d 0\n"
-      },
-      "dateCreated": "Dec 4, 2014 10:28:55 AM",
-      "dateStarted": "Apr 1, 2015 9:11:12 PM",
-      "dateFinished": "Apr 1, 2015 9:11:22 PM",
-      "status": "FINISHED",
-      "progressUpdateIntervalMs": 500
-    },
-    {
       "title": "Load data into table",
-      "text": "import sys.process._\n// Zeppelin creates and injects sc 
(SparkContext) and sqlContext (HiveContext or SqlContext)\n// So you don\u0027t 
need create them manually\n\nval zeppelinHome \u003d (\"pwd\" 
!!).replace(\"\\n\", \"\")\nval bankText \u003d 
sc.textFile(s\"file://$zeppelinHome/data/bank-full.csv\")\n\ncase class 
Bank(age: Integer, job: String, marital: String, education: String, balance: 
Integer)\n\nval bank \u003d bankText.map(s \u003d\u003e 
s.split(\";\")).filter(s \u003d\u003e s(0) !\u003d \"\\\"age\\\"\").map(\n    s 
\u003d\u003e Bank(s(0).toInt, \n            s(1).replaceAll(\"\\\"\", \"\"),\n  
          s(2).replaceAll(\"\\\"\", \"\"),\n            
s(3).replaceAll(\"\\\"\", \"\"),\n            s(5).replaceAll(\"\\\"\", 
\"\").toInt\n        )\n).toDF()\nbank.registerTempTable(\"bank\")\n\n",
+      "text": "import org.apache.commons.io.IOUtils\nimport 
java.net.URL\nimport java.nio.charset.Charset\n\n// Zeppelin creates and 
injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)\n// So you 
don\u0027t need create them manually\n\n// load bank data\nval bankText \u003d 
sc.parallelize(\n    IOUtils.toString(\n        new 
URL(\"https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv\";),\n     
   Charset.forName(\"utf8\")).split(\"\\n\"))\n\ncase class Bank(age: Integer, 
job: String, marital: String, education: String, balance: Integer)\n\nval bank 
\u003d bankText.map(s \u003d\u003e s.split(\";\")).filter(s \u003d\u003e s(0) 
!\u003d \"\\\"age\\\"\").map(\n    s \u003d\u003e Bank(s(0).toInt, \n           
 s(1).replaceAll(\"\\\"\", \"\"),\n            s(2).replaceAll(\"\\\"\", 
\"\"),\n            s(3).replaceAll(\"\\\"\", \"\"),\n            
s(5).replaceAll(\"\\\"\", \"\").toInt\n        
)\n).toDF()\nbank.registerTempTable(\"bank\")",
       "config": {
         "colWidth": 12.0,
         "graph": {
@@ -90,11 +57,11 @@
       "result": {
         "code": "SUCCESS",
         "type": "TEXT",
-        "msg": "import sys.process._\nsqlContext: 
org.apache.spark.sql.SQLContext \u003d 
org.apache.spark.sql.SQLContext@2c91e2d6\nwarning: there were 1 feature 
warning(s); re-run with -feature for details\nzeppelinHome: String \u003d 
/home/langley/lab/incubator-zeppelin\nbankText: 
org.apache.spark.rdd.RDD[String] \u003d 
/home/langley/lab/incubator-zeppelin/data/bank-full.csv MapPartitionsRDD[1] at 
textFile at \u003cconsole\u003e:31\ndefined class Bank\nbank: 
org.apache.spark.sql.DataFrame \u003d [age: int, job: string, marital: string, 
education: string, balance: int]\n"
+        "msg": "import org.apache.commons.io.IOUtils\nimport 
java.net.URL\nimport java.nio.charset.Charset\nbankText: 
org.apache.spark.rdd.RDD[String] \u003d ParallelCollectionRDD[32] at 
parallelize at \u003cconsole\u003e:65\ndefined class Bank\nbank: 
org.apache.spark.sql.DataFrame \u003d [age: int, job: string, marital: string, 
education: string, balance: int]\n"
       },
       "dateCreated": "Feb 10, 2015 1:52:59 AM",
-      "dateStarted": "Apr 1, 2015 9:11:28 PM",
-      "dateFinished": "Apr 1, 2015 9:11:39 PM",
+      "dateStarted": "Jul 3, 2015 1:43:40 PM",
+      "dateFinished": "Jul 3, 2015 1:43:45 PM",
       "status": "FINISHED",
       "progressUpdateIntervalMs": 500
     },
@@ -144,11 +111,11 @@
       "result": {
         "code": "SUCCESS",
         "type": "TABLE",
-        "msg": 
"age\tvalue\n18\t12\n19\t35\n20\t50\n21\t79\n22\t129\n23\t202\n24\t302\n25\t527\n26\t805\n27\t909\n28\t1038\n29\t1185\n"
+        "msg": 
"age\tvalue\n19\t4\n20\t3\n21\t7\n22\t9\n23\t20\n24\t24\n25\t44\n26\t77\n27\t94\n28\t103\n29\t97\n"
       },
       "dateCreated": "Feb 10, 2015 1:53:02 AM",
-      "dateStarted": "Apr 1, 2015 9:11:43 PM",
-      "dateFinished": "Apr 1, 2015 9:11:45 PM",
+      "dateStarted": "Jul 3, 2015 1:43:17 PM",
+      "dateFinished": "Jul 3, 2015 1:43:23 PM",
       "status": "FINISHED",
       "progressUpdateIntervalMs": 500
     },
@@ -206,11 +173,11 @@
       "result": {
         "code": "SUCCESS",
         "type": "TABLE",
-        "msg": 
"age\tvalue\n18\t12\n19\t35\n20\t50\n21\t79\n22\t129\n23\t202\n24\t302\n25\t527\n26\t805\n27\t909\n28\t1038\n29\t1185\n30\t1757\n31\t1996\n32\t2085\n33\t1972\n34\t1930\n"
+        "msg": 
"age\tvalue\n19\t4\n20\t3\n21\t7\n22\t9\n23\t20\n24\t24\n25\t44\n26\t77\n27\t94\n28\t103\n29\t97\n30\t150\n31\t199\n32\t224\n33\t186\n34\t231\n"
       },
       "dateCreated": "Feb 12, 2015 2:54:04 PM",
-      "dateStarted": "Apr 1, 2015 9:12:03 PM",
-      "dateFinished": "Apr 1, 2015 9:12:03 PM",
+      "dateStarted": "Jul 3, 2015 1:43:28 PM",
+      "dateFinished": "Jul 3, 2015 1:43:29 PM",
       "status": "FINISHED",
       "progressUpdateIntervalMs": 500
     },
@@ -279,11 +246,11 @@
       "result": {
         "code": "SUCCESS",
         "type": "TABLE",
-        "msg": 
"age\tvalue\n18\t12\n19\t35\n20\t47\n21\t74\n22\t120\n23\t175\n24\t248\n25\t423\n26\t615\n27\t658\n28\t697\n29\t683\n30\t1012\n31\t1017\n32\t941\n33\t746\n34\t650\n35\t631\n36\t538\n37\t453\n38\t394\n39\t346\n40\t257\n41\t241\n42\t218\n43\t183\n44\t170\n45\t146\n46\t130\n47\t100\n48\t124\n49\t101\n50\t76\n51\t72\n52\t62\n53\t71\n54\t55\n55\t54\n56\t45\n57\t38\n58\t35\n59\t36\n60\t27\n61\t5\n63\t2\n66\t5\n67\t3\n68\t4\n69\t2\n70\t1\n71\t1\n72\t5\n73\t2\n77\t1\n83\t2\n86\t1\n"
+        "msg": 
"age\tvalue\n19\t4\n20\t3\n21\t7\n22\t9\n23\t17\n24\t13\n25\t33\n26\t56\n27\t64\n28\t78\n29\t56\n30\t92\n31\t86\n32\t105\n33\t61\n34\t75\n35\t46\n36\t50\n37\t43\n38\t44\n39\t30\n40\t25\n41\t19\n42\t23\n43\t21\n44\t20\n45\t15\n46\t14\n47\t12\n48\t12\n49\t11\n50\t8\n51\t6\n52\t9\n53\t4\n55\t3\n56\t3\n57\t2\n58\t7\n59\t2\n60\t5\n66\t2\n69\t1\n"
       },
       "dateCreated": "Feb 13, 2015 11:04:22 PM",
-      "dateStarted": "Apr 1, 2015 9:12:10 PM",
-      "dateFinished": "Apr 1, 2015 9:12:10 PM",
+      "dateStarted": "Jul 3, 2015 1:43:33 PM",
+      "dateFinished": "Jul 3, 2015 1:43:34 PM",
       "status": "FINISHED",
       "progressUpdateIntervalMs": 500
     },
@@ -299,7 +266,8 @@
           "values": [],
           "groups": [],
           "scatter": {}
-        }
+        },
+        "editorHide": true
       },
       "settings": {
         "params": {},
@@ -319,22 +287,55 @@
       "progressUpdateIntervalMs": 500
     },
     {
-      "config": {},
+      "text": "%md\n\nAbout bank data\n\n```\nCitation Request:\n  This 
dataset is public available for research. The details are described in [Moro et 
al., 2011]. \n  Please include this citation if you plan to use this 
database:\n\n  [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using 
Data Mining for Bank Direct Marketing: An Application of the CRISP-DM 
Methodology. \n  In P. Novais et al. (Eds.), Proceedings of the European 
Simulation and Modelling Conference - ESM\u00272011, pp. 117-121, GuimarÃ£es, 
Portugal, October, 2011. EUROSIS.\n\n  Available at: [pdf] 
http://hdl.handle.net/1822/14838\n                [bib] 
http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt\n```";,
+      "config": {
+        "colWidth": 12.0,
+        "graph": {
+          "mode": "table",
+          "height": 300.0,
+          "optionOpen": false,
+          "keys": [],
+          "values": [],
+          "groups": [],
+          "scatter": {}
+        },
+        "editorHide": true
+      },
       "settings": {
         "params": {},
         "forms": {}
       },
       "jobName": "paragraph_1427420818407_872443482",
       "id": "20150326-214658_12335843",
+      "result": {
+        "code": "SUCCESS",
+        "type": "HTML",
+        "msg": "\u003cp\u003eAbout bank 
data\u003c/p\u003e\n\u003cpre\u003e\u003ccode\u003eCitation Request:\n  This 
dataset is public available for research. The details are described in [Moro et 
al., 2011]. \n  Please include this citation if you plan to use this 
database:\n\n  [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using 
Data Mining for Bank Direct Marketing: An Application of the CRISP-DM 
Methodology. \n  In P. Novais et al. (Eds.), Proceedings of the European 
Simulation and Modelling Conference - ESM\u00272011, pp. 117-121, GuimarÃ£es, 
Portugal, October, 2011. EUROSIS.\n\n  Available at: [pdf] 
http://hdl.handle.net/1822/14838\n                [bib] 
http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt\n\u003c/code\u003e\u003c/pre\u003e\n";
+      },
       "dateCreated": "Mar 26, 2015 9:46:58 PM",
+      "dateStarted": "Jul 3, 2015 1:44:56 PM",
+      "dateFinished": "Jul 3, 2015 1:44:56 PM",
+      "status": "FINISHED",
+      "progressUpdateIntervalMs": 500
+    },
+    {
+      "config": {},
+      "settings": {
+        "params": {},
+        "forms": {}
+      },
+      "jobName": "paragraph_1435955447812_-158639899",
+      "id": "20150703-133047_853701097",
+      "dateCreated": "Jul 3, 2015 1:30:47 PM",
       "status": "READY",
       "progressUpdateIntervalMs": 500
     }
   ],
   "name": "Zeppelin Tutorial",
   "id": "2A94M5J1Z",
+  "angularObjects": {},
   "config": {
     "looknfeel": "default"
   },
   "info": {}
-}
+}
\ No newline at end of file

incubator-zeppelin git commit: ZEPPELIN-55 Make tutorial notebook independent from filesystem.

Reply via email to