Repository: incubator-zeppelin
Updated Branches:
refs/heads/master 12e5abf28 -> 4fa701928
ZEPPELIN-55 Make tutorial notebook independent from filesystem.
Tutorial notebook is downloading data using `wget` and unzip and load the csv
file.
This works only in local-mode and not going to work with cluster deployments.
Discussed solution in the issue ZEPPELIN-55 are
* Upload data to HDFS
* Upload data to S3
However, not all user will install HDFS, and accessing S3 via hdfs client needs
accessKey and secretKey in configuration.
this PR make tutorial notebook independent from any filesystem, by reading data
from http(s) address and parallelize directly.
Here's how this PR loads data
```
// load bank data
val bankText = sc.parallelize(
IOUtils.toString(
new
URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
Charset.forName("utf8")).split("\n"))
case class Bank(age: Integer, job: String, marital: String, education: String,
balance: Integer)
val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
s => Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
).toDF()
bank.registerTempTable("bank")
```
Author: Lee moon soo <[email protected]>
Closes #140 from Leemoonsoo/ZEPPELIN-55 and squashes the following commits:
653b1bc [Lee moon soo] Load data directly from http without using filesystem
Project: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/repo
Commit:
http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/commit/4fa70192
Tree: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/tree/4fa70192
Diff: http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/diff/4fa70192
Branch: refs/heads/master
Commit: 4fa7019286a7f809a94c89c0cc71e767364ac1f3
Parents: 12e5abf
Author: Lee moon soo <[email protected]>
Authored: Fri Jul 3 13:46:35 2015 -0700
Committer: Lee moon soo <[email protected]>
Committed: Sun Jul 5 10:47:01 2015 -0700
----------------------------------------------------------------------
notebook/2A94M5J1Z/note.json | 101 +++++++++++++++++++-------------------
1 file changed, 51 insertions(+), 50 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-zeppelin/blob/4fa70192/notebook/2A94M5J1Z/note.json
----------------------------------------------------------------------
diff --git a/notebook/2A94M5J1Z/note.json b/notebook/2A94M5J1Z/note.json
index a37cf19..785ccea 100644
--- a/notebook/2A94M5J1Z/note.json
+++ b/notebook/2A94M5J1Z/note.json
@@ -13,7 +13,7 @@
"groups": [],
"scatter": {}
},
- "editorHide": false
+ "editorHide": true
},
"settings": {
"params": {},
@@ -33,41 +33,8 @@
"progressUpdateIntervalMs": 500
},
{
- "title": "Prepare data",
- "text": "import sys.process._\n//you will need \u0027wget\u0027 tool to
download\n\"wget
http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip\"
!\n\"mkdir data\" !\n\"unzip bank.zip -d data\" !\n\"rm bank.zip\" !",
- "config": {
- "colWidth": 12.0,
- "graph": {
- "mode": "table",
- "height": 300.0,
- "optionOpen": false,
- "keys": [],
- "values": [],
- "groups": [],
- "scatter": {}
- },
- "title": true
- },
- "settings": {
- "params": {},
- "forms": {}
- },
- "jobName": "paragraph_1417656535623_-196593192",
- "id": "20141204-102855_1590713432",
- "result": {
- "code": "SUCCESS",
- "type": "TEXT",
- "msg": "import sys.process._\nwarning: there were 1 feature
warning(s); re-run with -feature for details\nres1: Int \u003d 0\nwarning:
there were 1 feature warning(s); re-run with -feature for details\nres2: Int
\u003d 0\nwarning: there were 1 feature warning(s); re-run with -feature for
details\nres3: Int \u003d 0\nwarning: there were 1 feature warning(s); re-run
with -feature for details\nres4: Int \u003d 0\n"
- },
- "dateCreated": "Dec 4, 2014 10:28:55 AM",
- "dateStarted": "Apr 1, 2015 9:11:12 PM",
- "dateFinished": "Apr 1, 2015 9:11:22 PM",
- "status": "FINISHED",
- "progressUpdateIntervalMs": 500
- },
- {
"title": "Load data into table",
- "text": "import sys.process._\n// Zeppelin creates and injects sc
(SparkContext) and sqlContext (HiveContext or SqlContext)\n// So you don\u0027t
need create them manually\n\nval zeppelinHome \u003d (\"pwd\"
!!).replace(\"\\n\", \"\")\nval bankText \u003d
sc.textFile(s\"file://$zeppelinHome/data/bank-full.csv\")\n\ncase class
Bank(age: Integer, job: String, marital: String, education: String, balance:
Integer)\n\nval bank \u003d bankText.map(s \u003d\u003e
s.split(\";\")).filter(s \u003d\u003e s(0) !\u003d \"\\\"age\\\"\").map(\n s
\u003d\u003e Bank(s(0).toInt, \n s(1).replaceAll(\"\\\"\", \"\"),\n
s(2).replaceAll(\"\\\"\", \"\"),\n
s(3).replaceAll(\"\\\"\", \"\"),\n s(5).replaceAll(\"\\\"\",
\"\").toInt\n )\n).toDF()\nbank.registerTempTable(\"bank\")\n\n",
+ "text": "import org.apache.commons.io.IOUtils\nimport
java.net.URL\nimport java.nio.charset.Charset\n\n// Zeppelin creates and
injects sc (SparkContext) and sqlContext (HiveContext or SqlContext)\n// So you
don\u0027t need create them manually\n\n// load bank data\nval bankText \u003d
sc.parallelize(\n IOUtils.toString(\n new
URL(\"https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv\"),\n
Charset.forName(\"utf8\")).split(\"\\n\"))\n\ncase class Bank(age: Integer,
job: String, marital: String, education: String, balance: Integer)\n\nval bank
\u003d bankText.map(s \u003d\u003e s.split(\";\")).filter(s \u003d\u003e s(0)
!\u003d \"\\\"age\\\"\").map(\n s \u003d\u003e Bank(s(0).toInt, \n
s(1).replaceAll(\"\\\"\", \"\"),\n s(2).replaceAll(\"\\\"\",
\"\"),\n s(3).replaceAll(\"\\\"\", \"\"),\n
s(5).replaceAll(\"\\\"\", \"\").toInt\n
)\n).toDF()\nbank.registerTempTable(\"bank\")",
"config": {
"colWidth": 12.0,
"graph": {
@@ -90,11 +57,11 @@
"result": {
"code": "SUCCESS",
"type": "TEXT",
- "msg": "import sys.process._\nsqlContext:
org.apache.spark.sql.SQLContext \u003d
org.apache.spark.sql.SQLContext@2c91e2d6\nwarning: there were 1 feature
warning(s); re-run with -feature for details\nzeppelinHome: String \u003d
/home/langley/lab/incubator-zeppelin\nbankText:
org.apache.spark.rdd.RDD[String] \u003d
/home/langley/lab/incubator-zeppelin/data/bank-full.csv MapPartitionsRDD[1] at
textFile at \u003cconsole\u003e:31\ndefined class Bank\nbank:
org.apache.spark.sql.DataFrame \u003d [age: int, job: string, marital: string,
education: string, balance: int]\n"
+ "msg": "import org.apache.commons.io.IOUtils\nimport
java.net.URL\nimport java.nio.charset.Charset\nbankText:
org.apache.spark.rdd.RDD[String] \u003d ParallelCollectionRDD[32] at
parallelize at \u003cconsole\u003e:65\ndefined class Bank\nbank:
org.apache.spark.sql.DataFrame \u003d [age: int, job: string, marital: string,
education: string, balance: int]\n"
},
"dateCreated": "Feb 10, 2015 1:52:59 AM",
- "dateStarted": "Apr 1, 2015 9:11:28 PM",
- "dateFinished": "Apr 1, 2015 9:11:39 PM",
+ "dateStarted": "Jul 3, 2015 1:43:40 PM",
+ "dateFinished": "Jul 3, 2015 1:43:45 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
@@ -144,11 +111,11 @@
"result": {
"code": "SUCCESS",
"type": "TABLE",
- "msg":
"age\tvalue\n18\t12\n19\t35\n20\t50\n21\t79\n22\t129\n23\t202\n24\t302\n25\t527\n26\t805\n27\t909\n28\t1038\n29\t1185\n"
+ "msg":
"age\tvalue\n19\t4\n20\t3\n21\t7\n22\t9\n23\t20\n24\t24\n25\t44\n26\t77\n27\t94\n28\t103\n29\t97\n"
},
"dateCreated": "Feb 10, 2015 1:53:02 AM",
- "dateStarted": "Apr 1, 2015 9:11:43 PM",
- "dateFinished": "Apr 1, 2015 9:11:45 PM",
+ "dateStarted": "Jul 3, 2015 1:43:17 PM",
+ "dateFinished": "Jul 3, 2015 1:43:23 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
@@ -206,11 +173,11 @@
"result": {
"code": "SUCCESS",
"type": "TABLE",
- "msg":
"age\tvalue\n18\t12\n19\t35\n20\t50\n21\t79\n22\t129\n23\t202\n24\t302\n25\t527\n26\t805\n27\t909\n28\t1038\n29\t1185\n30\t1757\n31\t1996\n32\t2085\n33\t1972\n34\t1930\n"
+ "msg":
"age\tvalue\n19\t4\n20\t3\n21\t7\n22\t9\n23\t20\n24\t24\n25\t44\n26\t77\n27\t94\n28\t103\n29\t97\n30\t150\n31\t199\n32\t224\n33\t186\n34\t231\n"
},
"dateCreated": "Feb 12, 2015 2:54:04 PM",
- "dateStarted": "Apr 1, 2015 9:12:03 PM",
- "dateFinished": "Apr 1, 2015 9:12:03 PM",
+ "dateStarted": "Jul 3, 2015 1:43:28 PM",
+ "dateFinished": "Jul 3, 2015 1:43:29 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
@@ -279,11 +246,11 @@
"result": {
"code": "SUCCESS",
"type": "TABLE",
- "msg":
"age\tvalue\n18\t12\n19\t35\n20\t47\n21\t74\n22\t120\n23\t175\n24\t248\n25\t423\n26\t615\n27\t658\n28\t697\n29\t683\n30\t1012\n31\t1017\n32\t941\n33\t746\n34\t650\n35\t631\n36\t538\n37\t453\n38\t394\n39\t346\n40\t257\n41\t241\n42\t218\n43\t183\n44\t170\n45\t146\n46\t130\n47\t100\n48\t124\n49\t101\n50\t76\n51\t72\n52\t62\n53\t71\n54\t55\n55\t54\n56\t45\n57\t38\n58\t35\n59\t36\n60\t27\n61\t5\n63\t2\n66\t5\n67\t3\n68\t4\n69\t2\n70\t1\n71\t1\n72\t5\n73\t2\n77\t1\n83\t2\n86\t1\n"
+ "msg":
"age\tvalue\n19\t4\n20\t3\n21\t7\n22\t9\n23\t17\n24\t13\n25\t33\n26\t56\n27\t64\n28\t78\n29\t56\n30\t92\n31\t86\n32\t105\n33\t61\n34\t75\n35\t46\n36\t50\n37\t43\n38\t44\n39\t30\n40\t25\n41\t19\n42\t23\n43\t21\n44\t20\n45\t15\n46\t14\n47\t12\n48\t12\n49\t11\n50\t8\n51\t6\n52\t9\n53\t4\n55\t3\n56\t3\n57\t2\n58\t7\n59\t2\n60\t5\n66\t2\n69\t1\n"
},
"dateCreated": "Feb 13, 2015 11:04:22 PM",
- "dateStarted": "Apr 1, 2015 9:12:10 PM",
- "dateFinished": "Apr 1, 2015 9:12:10 PM",
+ "dateStarted": "Jul 3, 2015 1:43:33 PM",
+ "dateFinished": "Jul 3, 2015 1:43:34 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
@@ -299,7 +266,8 @@
"values": [],
"groups": [],
"scatter": {}
- }
+ },
+ "editorHide": true
},
"settings": {
"params": {},
@@ -319,22 +287,55 @@
"progressUpdateIntervalMs": 500
},
{
- "config": {},
+ "text": "%md\n\nAbout bank data\n\n```\nCitation Request:\n This
dataset is public available for research. The details are described in [Moro et
al., 2011]. \n Please include this citation if you plan to use this
database:\n\n [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using
Data Mining for Bank Direct Marketing: An Application of the CRISP-DM
Methodology. \n In P. Novais et al. (Eds.), Proceedings of the European
Simulation and Modelling Conference - ESM\u00272011, pp. 117-121, Guimarães,
Portugal, October, 2011. EUROSIS.\n\n Available at: [pdf]
http://hdl.handle.net/1822/14838\n [bib]
http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt\n```",
+ "config": {
+ "colWidth": 12.0,
+ "graph": {
+ "mode": "table",
+ "height": 300.0,
+ "optionOpen": false,
+ "keys": [],
+ "values": [],
+ "groups": [],
+ "scatter": {}
+ },
+ "editorHide": true
+ },
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1427420818407_872443482",
"id": "20150326-214658_12335843",
+ "result": {
+ "code": "SUCCESS",
+ "type": "HTML",
+ "msg": "\u003cp\u003eAbout bank
data\u003c/p\u003e\n\u003cpre\u003e\u003ccode\u003eCitation Request:\n This
dataset is public available for research. The details are described in [Moro et
al., 2011]. \n Please include this citation if you plan to use this
database:\n\n [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using
Data Mining for Bank Direct Marketing: An Application of the CRISP-DM
Methodology. \n In P. Novais et al. (Eds.), Proceedings of the European
Simulation and Modelling Conference - ESM\u00272011, pp. 117-121, Guimarães,
Portugal, October, 2011. EUROSIS.\n\n Available at: [pdf]
http://hdl.handle.net/1822/14838\n [bib]
http://www3.dsi.uminho.pt/pcortez/bib/2011-esm-1.txt\n\u003c/code\u003e\u003c/pre\u003e\n"
+ },
"dateCreated": "Mar 26, 2015 9:46:58 PM",
+ "dateStarted": "Jul 3, 2015 1:44:56 PM",
+ "dateFinished": "Jul 3, 2015 1:44:56 PM",
+ "status": "FINISHED",
+ "progressUpdateIntervalMs": 500
+ },
+ {
+ "config": {},
+ "settings": {
+ "params": {},
+ "forms": {}
+ },
+ "jobName": "paragraph_1435955447812_-158639899",
+ "id": "20150703-133047_853701097",
+ "dateCreated": "Jul 3, 2015 1:30:47 PM",
"status": "READY",
"progressUpdateIntervalMs": 500
}
],
"name": "Zeppelin Tutorial",
"id": "2A94M5J1Z",
+ "angularObjects": {},
"config": {
"looknfeel": "default"
},
"info": {}
-}
+}
\ No newline at end of file