[GitHub] zeppelin pull request #1339: [WIP][ZEPPELIN-1332] Remove spark-dependencies ...

AhyoungRyu Wed, 17 Aug 2016 23:04:49 -0700

GitHub user AhyoungRyu reopened a pull request:

    https://github.com/apache/zeppelin/pull/1339


    [WIP][ZEPPELIN-1332] Remove spark-dependencies & suggest new way

    ### What is this PR for?
    Currently, Zeppelin's embedded Spark is located under `interpreter/spark/`. 
    For whom **builds Zeppelin from source**, this Spark is downloaded when 
they build the source with [build 
profiles](https://github.com/apache/zeppelin#spark-interpreter). I think this 
various build profiles are useful to customize the embedded Spark, but many 
Spark users are using their own Spark not Zeppelin's embedded one. Nowadays 
only Spark&Zeppelin beginners use this embedded Spark. For them, there are too 
many build profiles(it's so complicated i think). 
    In case of **Zeppelin binary package**, it's included by default under 
`interpreter/spark/`. That's why Zeppelin package size is so huge. 
    
    This PR will change the embedded Spark binary downloading mechanism as like 
below.
    
    1. If users didn't set their own `SPARK_HOME`, 
[bin/download-spark.sh](https://github.com/AhyoungRyu/zeppelin/blob/5703fbf27fedda9ec7dd142e275b8654c9bc6296/bin/download-spark.sh)
 will be run when they start Zeppelin server using `bin/zeppelin-daemon.sh` or 
`bin/zeppelin.sh`.
    2. 
[bin/download-spark.sh](https://github.com/AhyoungRyu/zeppelin/blob/5703fbf27fedda9ec7dd142e275b8654c9bc6296/bin/download-spark.sh)
 : download `spark-2.0.0-bin-hadoop2.7.tgz` from mirror site to 
`$ZEPPELIN_HOME/.spark-dist/` and untar -> set `SPARK_HOME` as 
`$ZEPPELIN_HOME/.spark-dist/spark-2.0.0-bin-hadoop2.7` -> add this `SPARK_HOME` 
to `conf/zeppelin-env.sh`
    
    With this new mechanism, we can not only reduce Zeppelin overall binary 
package size but also user doesn't need to type complicating build profiles 
when they build Zeppelin source.
    
    ### What type of PR is it?
    Improvement
    
    ### Todos
    * [ ] - update 
[README.md](https://github.com/apache/zeppelin/blob/master/README.md)
    * [ ] - add `download-spark.cmd` for Window users 
    
    ### What is the Jira issue?
    See [ZEPPELIN-1332](https://issues.apache.org/jira/browse/ZEPPELIN-1332)'s 
description for the details about **Why we need to remove spark-dependencies** 
& **New suggestion for Zeppelin's embedded Spark binary**.
    
    
    ### How should this be tested?
    After apply this patch, build with `mvn clean package -DskipTests`. Please 
note that you need to check `spark-dependencies` is removed well or not.
     - Without prespecified `SPARK_HOME` 
      1. Start Zeppelin daemon
      <img width="975" alt="screen shot 2016-08-18 at 11 20 27 am" 
src="https://cloud.githubusercontent.com/assets/10060731/17759836/e3c16022-6535-11e6-8576-43975c3293c3.png";>
      2. Check `conf/zeppelin-env.sh` line 46. `SPARK_HOME` will be set like 
below 
      ```
      export 
SPARK_HOME="/YOUR_ZEPPELIN_HOME/.spark-dist/spark-2.0.0-bin-hadoop2.7"
      ```
      3. Go to Zeppelin website and run `sc.version` with Spark interpreter & 
`echo $SPARK_HOME` with sh interpreter.
      <img width="1030" alt="screen shot 2016-08-18 at 11 26 21 am" 
src="https://cloud.githubusercontent.com/assets/10060731/17759937/a7bcc584-6536-11e6-9664-cffdc6e5bdf8.png";>
    
     - With prespecified `SPARK_HOME`
    Nothing happened. Zeppelin will be started as like before.
     
    ### Screenshots (if appropriate)
    
    ### Questions:
    * Does the licenses files need update? no
    * Is there breaking changes for older versions? no
    * Does this needs documentation? need to update 
[README.md](https://github.com/apache/zeppelin/blob/master/README.md)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AhyoungRyu/zeppelin ZEPPELIN-1332

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zeppelin/pull/1339.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1339
    
----
commit ae74e90f8409b7396eeebf34c103a6db071b1771
Author: AhyoungRyu <[email protected]>
Date:   2016-08-16T15:08:19Z

    Fix typo comment in interpreter.sh

commit ada6f37d1df60f37740d63c913cdd89f7b919269
Author: AhyoungRyu <[email protected]>
Date:   2016-08-17T01:52:06Z

    Remove spark-dependencies

commit 87b929d7d38e447306796cec44b35cb7317b9bb3
Author: AhyoungRyu <[email protected]>
Date:   2016-08-17T07:14:35Z

    Add spark-2.*-bin-hadoop* to .gitignore

commit 5703fbf27fedda9ec7dd142e275b8654c9bc6296
Author: AhyoungRyu <[email protected]>
Date:   2016-08-17T15:22:25Z

    Add download-spark.sh file

commit 35350bb9990436cd7ede1e611f0b94a56ed24793
Author: AhyoungRyu <[email protected]>
Date:   2016-08-17T15:28:51Z

    Remove useless comment line in common.sh

commit d6500a854c0a6a3616023c507fbdd061ae731288
Author: AhyoungRyu <[email protected]>
Date:   2016-08-18T03:32:11Z

    Remove zeppelin-spark-dependencies from r/pom.xml

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin pull request #1339: [WIP][ZEPPELIN-1332] Remove spark-dependencies ...

Reply via email to