GitHub user AhyoungRyu opened a pull request:
https://github.com/apache/zeppelin/pull/1339
[WIP][ZEPPELIN-1332] Remove spark-dependencies & suggest new way
### What is this PR for?
Currently, Zeppelin's embedded Spark is located under `interpreter/spark/`.
For whom **builds Zeppelin from source**, this Spark is downloaded when they
build the source with [build
profiles](https://github.com/apache/zeppelin#spark-interpreter). I think this
various build profiles are useful to customize the embedded Spark, but many
Spark users are using their own Spark not Zeppelin's embedded one. Nowadays,
only Spark&Zeppelin beginners use this embedded Spark. For them, there are too
many build profiles(it's so complicated i think). In case of **Zeppelin binary
package**, it's included by default under `interpreter/spark/`. That's why
Zeppelin package size is so huge.
This PR will change the embedded Spark binary downloading mechanism as like
below.
1. If user didn't set their own `SPARK_HOME`,
[bin/download-spark.sh](https://github.com/AhyoungRyu/zeppelin/blob/5703fbf27fedda9ec7dd142e275b8654c9bc6296/bin/download-spark.sh)
will be run when they start Zeppelin server using `bin/zeppelin-daemon.sh` or
`bin/zeppelin.sh`.
2.
[bin/download-spark.sh](https://github.com/AhyoungRyu/zeppelin/blob/5703fbf27fedda9ec7dd142e275b8654c9bc6296/bin/download-spark.sh)
: Download `spark-2.0.0-bin-hadoop2.7.tgz` from mirror site to
`$ZEPPELIN_HOME/.spark-dist/` and untar -> Set `SPARK_HOME` as
`$ZEPPELIN_HOME/.spark-dist/spark-2.0.0-bin-hadoop2.7` -> add this `SPARK_HOME`
to `conf/zeppelin-env.sh`
With this new mechanism, we can not only reduce Zeppelin overall binary
package size but also user doesn't need to type complicating build profiles
when they build Zeppelin source.
### What type of PR is it?
Improvement
### Todos
* [ ] - update
[README.md](https://github.com/apache/zeppelin/blob/master/README.md)
* [ ] - add `download-spark.cmd` for Window users
### What is the Jira issue?
See [ZEPPELIN-1332](https://issues.apache.org/jira/browse/ZEPPELIN-1332)'s
description for the details about **Why we need to remove spark-dependencies**
& **New suggestion for Zeppelin's embedded Spark binary**.
### How should this be tested?
After apply this patch, build with `mvn clean package -DskipTests`. Please
note that you need to check `spark-dependencies` is removed well or not.
- Without prespecified `SPARK_HOME`
1. Start Zeppelin daemon
<img width="975" alt="screen shot 2016-08-18 at 11 20 27 am"
src="https://cloud.githubusercontent.com/assets/10060731/17759836/e3c16022-6535-11e6-8576-43975c3293c3.png">
2. Check `conf/zeppelin-env.sh` line 46. `SPARK_HOME` will be set like
below
```
export
SPARK_HOME="/YOUR_ZEPPELIN_HOME/.spark-dist/spark-2.0.0-bin-hadoop2.7"
```
3. Go to Zeppelin website and run `sc.version` with Spark interpreter &
`echo $SPARK_HOME` with sh interpreter.
<img width="1030" alt="screen shot 2016-08-18 at 11 26 21 am"
src="https://cloud.githubusercontent.com/assets/10060731/17759937/a7bcc584-6536-11e6-9664-cffdc6e5bdf8.png">
- With prespecified `SPARK_HOME`
Nothing happened. Zeppelin will be started as like before.
### Screenshots (if appropriate)
### Questions:
* Does the licenses files need update? no
* Is there breaking changes for older versions? no
* Does this needs documentation? need to update
[README.md](https://github.com/apache/zeppelin/blob/master/README.md)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/AhyoungRyu/zeppelin ZEPPELIN-1332
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/zeppelin/pull/1339.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1339
----
commit ae74e90f8409b7396eeebf34c103a6db071b1771
Author: AhyoungRyu <[email protected]>
Date: 2016-08-16T15:08:19Z
Fix typo comment in interpreter.sh
commit ada6f37d1df60f37740d63c913cdd89f7b919269
Author: AhyoungRyu <[email protected]>
Date: 2016-08-17T01:52:06Z
Remove spark-dependencies
commit 87b929d7d38e447306796cec44b35cb7317b9bb3
Author: AhyoungRyu <[email protected]>
Date: 2016-08-17T07:14:35Z
Add spark-2.*-bin-hadoop* to .gitignore
commit 5703fbf27fedda9ec7dd142e275b8654c9bc6296
Author: AhyoungRyu <[email protected]>
Date: 2016-08-17T15:22:25Z
Add download-spark.sh file
commit 35350bb9990436cd7ede1e611f0b94a56ed24793
Author: AhyoungRyu <[email protected]>
Date: 2016-08-17T15:28:51Z
Remove useless comment line in common.sh
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---