Re: [DISCUSS] Spark-Kernel Incubator Proposal

P. Taylor Goetz Fri, 13 Nov 2015 14:29:21 -0800

Thanks for the reference Alex. It answers my question regarding the path you 
chose.


-Taylor

> On Nov 13, 2015, at 12:13 AM, Alexander Bezzubov <abezzu...@nflabs.com> wrote:
> 
> Hi,
> 
> it looks pretty interesting, especially a part about integration with
> Zeppelin as another Scala interpreter implementation.
> 
> AFAIK there was a discussion on including Spark-Kernel to spark core
> https://issues.apache.org/jira/browse/SPARK-4605 but not sure about a
> possibility of becoming a sub-project one.
> 
> Would be interesting to know as indeed it looks very aligned with Apache
> Spark.
> 
> --
> Alex
> 
>> On Fri, Nov 13, 2015 at 10:05 AM, P. Taylor Goetz <ptgo...@gmail.com> wrote:
>> 
>> Just a quick (or maybe not :) ) question...
>> 
>> Given the tight coupling to the Apache Spark project, were there any
>> considerations or discussions with the Spark community regarding including
>> the Spark-Kernel functionality outright in Spark, or the possibility of
>> becoming a subproject?
>> 
>> I'm just curious. I don't think an answer one way or another would
>> necessarily block incubation.
>> 
>> -Taylor
>> 
>>> On Nov 12, 2015, at 7:17 PM, da...@fallside.com wrote:
>>> 
>>> Hello, we would like to start a discussion on accepting the Spark-Kernel,
>>> a mechanism for applications to interactively and remotely access Apache
>>> Spark, into the Apache Incubator.
>>> 
>>> The proposal is available online at
>>> https://wiki.apache.org/incubator/SparkKernelProposal, and it is
>> appended
>>> to this email.
>>> 
>>> We are looking for additional mentors to help with this project, and we
>>> would much appreciate your guidance and advice.
>>> 
>>> Thank-you in advance,
>>> David Fallside
>>> 
>>> 
>>> 
>>> = Spark-Kernel Proposal =
>>> 
>>> == Abstract ==
>>> Spark-Kernel provides applications with a mechanism to interactively and
>>> remotely access Apache Spark.
>>> 
>>> == Proposal ==
>>> The Spark-Kernel enables interactive applications to access Apache Spark
>>> clusters. More specifically:
>>> * Applications can send code-snippets and libraries for execution by
>> Spark
>>> * Applications can be deployed separately from Spark clusters and
>>> communicate with the Spark-Kernel using the provided Spark-Kernel client
>>> * Execution results and streaming data can be sent back to calling
>>> applications
>>> * Applications no longer have to be network connected to the workers on a
>>> Spark cluster because the Spark-Kernel acts as each application’s proxy
>>> * Work has started on enabling Spark-Kernel to support languages in
>>> addition to Scala, namely Python (with PySpark), R (with SparkR), and SQL
>>> (with SparkSQL)
>>> 
>>> == Background & Rationale ==
>>> Apache Spark provides applications with a fast and general purpose
>>> distributed computing engine that supports static and streaming data,
>>> tabular and graph representations of data, and an extensive library of
>>> machine learning libraries. Consequently, a wide variety of applications
>>> will be written for Spark and there will be interactive applications that
>>> require relatively frequent function evaluations, and batch-oriented
>>> applications that require one-shot or only occasional evaluation.
>>> 
>>> Apache Spark provides two mechanisms for applications to connect with
>>> Spark. The primary mechanism launches applications on Spark clusters
>> using
>>> spark-submit
>>> (http://spark.apache.org/docs/latest/submitting-applications.html); this
>>> requires developers to bundle their application code plus any
>> dependencies
>>> into JAR files, and then submit them to Spark. A second mechanism is an
>>> ODBC/JDBC API
>>> (
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
>> )
>>> which enables applications to issue SQL queries against SparkSQL.
>>> 
>>> Our experience when developing interactive applications, such as analytic
>>> applications and Jupyter Notebooks, to run against Spark was that the
>>> spark-submit mechanism was overly cumbersome and slow (requiring JAR
>>> creation and forking processes to run spark-submit), and the SQL
>> interface
>>> was too limiting and did not offer easy access to components other than
>>> SparkSQL, such as streaming. The most promising mechanism provided by
>>> Apache Spark was the command-line shell
>>> (
>> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
>> )
>>> which enabled us to execute code snippets and dynamically control the
>>> tasks submitted to  a Spark cluster. Spark does not provide the
>>> command-line shell as a consumable service but it provided us with the
>>> starting point from which we developed the Spark-Kernel.
>>> 
>>> == Current Status ==
>>> Spark-Kernel was first developed by a small team working on an
>>> internal-IBM Spark-related project in July 2014. In recognition of its
>>> likely general utility to Spark users and developers, in November 2014
>> the
>>> Spark-Kernel project was moved to GitHub and made available under the
>>> Apache License V2.
>>> 
>>> == Meritocracy ==
>>> The current developers are familiar with the meritocratic open source
>>> development process at Apache. As the project has gathered interest at
>>> GitHub the developers have actively started a process to invite
>> additional
>>> developers into the project, and we have at least one new developer who
>> is
>>> ready to contribute code to the project.
>>> 
>>> == Community ==
>>> We started building a community around the Spark-Kernel project when we
>>> moved it to GitHub about one year ago. Since then we have grown to about
>>> 70 people, and there are regular requests and suggestions from the
>>> community. We believe that providing Apache Spark application developers
>>> with a general-purpose and interactive API holds a lot of community
>>> potential, especially considering possible tie-in’s with the Jupyter and
>>> data science community.
>>> 
>>> == Core Developers ==
>>> The core developers of the project are currently all from IBM, from the
>>> IBM Emerging Technology team and from IBM’s recently formed Spark
>>> Technology Center.
>>> 
>>> == Alignment ==
>>> Apache, as the home of Apache Spark, is the most natural home for the
>>> Spark-Kernel project because it was designed to work with Apache Spark
>> and
>>> to provide capabilities for interactive applications and data science
>>> tools not provided by Spark itself.
>>> 
>>> The Spark-Kernel also has an affinity with Jupyter (jupyter.org) because
>>> it uses the Jupyter protocol for communications, and so Jupyter Notebooks
>>> can directly use the Spark-Kernel as a kernel for communicating with
>>> Apache Spark. However, we believe that the Spark-Kernel provides a
>>> general-purpose mechanism enabling a wider variety of applications than
>>> just Notebooks to access Spark, and so the Spark-Kernel’s greatest
>>> affinity is with Apache and Apache Spark.
>>> 
>>> == Known Risks ==
>>> === Orphaned products ===
>>> We believe the Spark-Kernel project has a low-risk of abandonment due to
>>> interest in its continuing existence from several parties. More
>>> specifically, the Spark-Kernel provides a capability that is not provided
>>> by Apache Spark today but it enables a wider range of applications to
>>> leverage Spark. For example, IBM uses (and is considering) the
>>> Spark-Kernel in several offerings including its IBM Analytics for Apache
>>> Spark product in the Bluemix Cloud. There are also a couple of other
>>> commercial users who are using or considering its use in their offerings.
>>> Furthermore, Jupyter Notebooks are used by data scientists and Spark is
>>> gaining popularity as an analytic engine for them. Jupyter Notebooks are
>>> very easily enabled with the Spark-Kernel and so there is another
>>> constituency for it.
>>> 
>>> === Inexperience with Open Source ===
>>> The Spark-Kernel project has been running as an open-source project
>>> (albeit with only IBM committers) for the past several months. The
>> project
>>> has an active issue tracker and due to the interest indicated by the
>>> nature and volume of requests and comments, the team has publicly stated
>>> it is beginning to build a process so they can accept third-party
>>> contributions to the project.
>>> 
>>> === Relationships with Other Apache Products ===
>>> The Spark-Kernel has a clear affinity with the Apache Spark project
>>> because it is designed to  provide capabilities for interactive
>>> applications and data science tools not provided by Spark itself. The
>>> Spark-Kernel can be a back-end for the Zeppelin project currently
>>> incubating at Apache. There is interest from the Spark-Kernel community
>> to
>>> develop this capability and an experimental branch has been started.
>>> 
>>> === Homogeneous Developers ===
>>> The current group of developers working on Spark-Kernel are all from IBM
>>> although the group is in the process of expanding its membership to
>>> include members of the GitHub community who are not from IBM and who have
>>> been active in the Spark-Kernel community in GutHub.
>>> 
>>> === Reliance on Salaried Developers ===
>>> The initial committers are full-time employees at IBM although not all
>>> work on the project full-time.
>>> 
>>> === Excessive Fascination with the Apache Brand ===
>>> We believe the Spark-Kernel benefits Apache Spark application developers,
>>> and we are interested in an Apache Spark-Kernel project to benefit these
>>> developers by engaging a larger community, facilitating closer ties with
>>> the existing Spark project, and yes, gaining more visibility for the
>>> Spark-Kernel as a solution.
>>> 
>>> We have recently become aware that the project name “Spark-Kernel” may be
>>> interpreted as having an association with an Apache project. If the
>>> project is accepted by Apache, we suggest the project name remains the
>>> same, but otherwise we will change it to one that does not imply any
>>> Apache association.
>>> 
>>> === Documentation ===
>>> Comprehensive documentation including “Getting Started”, API
>>> specifications and a Roadmap are available from the GitHub project, see
>>> https://github.com/ibm-et/spark-kernel/wiki.
>>> 
>>> === Initial Source ===
>>> The source code resides at https://github.com/ibm-et/spark-kernel.
>>> 
>>> === External Dependencies ===
>>> The Spark-Kernel depends upon a number of Apache projects:
>>> * Spark
>>> * Hadoop
>>> * Ivy
>>> * Commons
>>> 
>>> The Spark-Kernel also depends upon a number of other open source
>> projects:
>>> * JeroMQ (LGPL with Static Linking Exception,
>>> http://zeromq.org/area:licensing)
>>> * Akka (MIT)
>>> * JOpt Simple (MIT)
>>> * Spring Framework Core (Apache v2)
>>> * Play (Apache v2)
>>> * SLF4J (MIT)
>>> * Scala
>>> * Scalatest (Apache v2)
>>> * Scalactic (Apache v2)
>>> * Mockito (MIT)
>>> 
>>> == Required Resources ==
>>> Developer and user mailing lists
>>> * priv...@spark-kernel.incubator.apache.org (with moderated
>> subscriptions)
>>> * comm...@spark-kernel.incubator.apache.org
>>> * d...@spark-kernel.incubator.apache.org
>>> * us...@spark-kernel.incubator.apache.org
>>> 
>>> A git repository:
>>> https://git-wip-us.apache.org/repos/asf/incubator-spark-kernel.git
>>> 
>>> A JIRA issue tracker: https://issues.apache.org/jira/browse/SPARK-KERNEL
>>> 
>>> == Initial Committers ==
>>> The initial list of committers is:
>>> * Leugim Bustelo (g...@bustelos.com)
>>> * Jakob Odersky (joder...@gmail.com)
>>> * Luciano Resende (lrese...@apache.org)
>>> * Robert Senkbeil (chip.senkb...@gmail.com)
>>> * Corey Stubbs (cas5...@gmail.com)
>>> * Miao Wang (wm...@hotmail.com)
>>> * Sean Welleck (welle...@gmail.com)
>>> 
>>> === Affiliations ===
>>> All of the initial committers are employed by IBM.
>>> 
>>> == Sponsors ==
>>> === Champion ===
>>> * Sam Ruby (IBM)
>>> 
>>> === Nominated Mentors ===
>>> * Luciano Resende
>>> 
>>> We wish to recruit additional mentors during incubation.
>>> 
>>> === Sponsoring Entity ===
>>> The Apache Incubator.
>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>>> For additional commands, e-mail: general-h...@incubator.apache.org
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
> 
> 
> -- 
> --
> Kind regards,
> Alexander.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [DISCUSS] Spark-Kernel Incubator Proposal

Reply via email to