Re: [DISCUSS] Spark-Kernel Incubator Proposal

Reynold Xin Fri, 13 Nov 2015 15:50:20 -0800

I'm happy to mentor the incubation if you are still looking for mentors.

I'd also like to second Matei that spark-kernel as a name is fairly
confusing. It only makes sense when viewing from IPython notebook's point
of view to refer to these things as kernels. Outside of that context, it
sounds like it is the spark-core module, which this obviously isn't.




On Fri, Nov 13, 2015 at 2:28 PM, P. Taylor Goetz <ptgo...@gmail.com> wrote:

> Thanks for the reference Alex. It answers my question regarding the path
> you chose.
>
> -Taylor
>
> > On Nov 13, 2015, at 12:13 AM, Alexander Bezzubov <abezzu...@nflabs.com>
> wrote:
> >
> > Hi,
> >
> > it looks pretty interesting, especially a part about integration with
> > Zeppelin as another Scala interpreter implementation.
> >
> > AFAIK there was a discussion on including Spark-Kernel to spark core
> > https://issues.apache.org/jira/browse/SPARK-4605 but not sure about a
> > possibility of becoming a sub-project one.
> >
> > Would be interesting to know as indeed it looks very aligned with Apache
> > Spark.
> >
> > --
> > Alex
> >
> >> On Fri, Nov 13, 2015 at 10:05 AM, P. Taylor Goetz <ptgo...@gmail.com>
> wrote:
> >>
> >> Just a quick (or maybe not :) ) question...
> >>
> >> Given the tight coupling to the Apache Spark project, were there any
> >> considerations or discussions with the Spark community regarding
> including
> >> the Spark-Kernel functionality outright in Spark, or the possibility of
> >> becoming a subproject?
> >>
> >> I'm just curious. I don't think an answer one way or another would
> >> necessarily block incubation.
> >>
> >> -Taylor
> >>
> >>> On Nov 12, 2015, at 7:17 PM, da...@fallside.com wrote:
> >>>
> >>> Hello, we would like to start a discussion on accepting the
> Spark-Kernel,
> >>> a mechanism for applications to interactively and remotely access
> Apache
> >>> Spark, into the Apache Incubator.
> >>>
> >>> The proposal is available online at
> >>> https://wiki.apache.org/incubator/SparkKernelProposal, and it is
> >> appended
> >>> to this email.
> >>>
> >>> We are looking for additional mentors to help with this project, and we
> >>> would much appreciate your guidance and advice.
> >>>
> >>> Thank-you in advance,
> >>> David Fallside
> >>>
> >>>
> >>>
> >>> = Spark-Kernel Proposal =
> >>>
> >>> == Abstract ==
> >>> Spark-Kernel provides applications with a mechanism to interactively
> and
> >>> remotely access Apache Spark.
> >>>
> >>> == Proposal ==
> >>> The Spark-Kernel enables interactive applications to access Apache
> Spark
> >>> clusters. More specifically:
> >>> * Applications can send code-snippets and libraries for execution by
> >> Spark
> >>> * Applications can be deployed separately from Spark clusters and
> >>> communicate with the Spark-Kernel using the provided Spark-Kernel
> client
> >>> * Execution results and streaming data can be sent back to calling
> >>> applications
> >>> * Applications no longer have to be network connected to the workers
> on a
> >>> Spark cluster because the Spark-Kernel acts as each application’s proxy
> >>> * Work has started on enabling Spark-Kernel to support languages in
> >>> addition to Scala, namely Python (with PySpark), R (with SparkR), and
> SQL
> >>> (with SparkSQL)
> >>>
> >>> == Background & Rationale ==
> >>> Apache Spark provides applications with a fast and general purpose
> >>> distributed computing engine that supports static and streaming data,
> >>> tabular and graph representations of data, and an extensive library of
> >>> machine learning libraries. Consequently, a wide variety of
> applications
> >>> will be written for Spark and there will be interactive applications
> that
> >>> require relatively frequent function evaluations, and batch-oriented
> >>> applications that require one-shot or only occasional evaluation.
> >>>
> >>> Apache Spark provides two mechanisms for applications to connect with
> >>> Spark. The primary mechanism launches applications on Spark clusters
> >> using
> >>> spark-submit
> >>> (http://spark.apache.org/docs/latest/submitting-applications.html);
> this
> >>> requires developers to bundle their application code plus any
> >> dependencies
> >>> into JAR files, and then submit them to Spark. A second mechanism is an
> >>> ODBC/JDBC API
> >>> (
> >>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
> >> )
> >>> which enables applications to issue SQL queries against SparkSQL.
> >>>
> >>> Our experience when developing interactive applications, such as
> analytic
> >>> applications and Jupyter Notebooks, to run against Spark was that the
> >>> spark-submit mechanism was overly cumbersome and slow (requiring JAR
> >>> creation and forking processes to run spark-submit), and the SQL
> >> interface
> >>> was too limiting and did not offer easy access to components other than
> >>> SparkSQL, such as streaming. The most promising mechanism provided by
> >>> Apache Spark was the command-line shell
> >>> (
> >>
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> >> )
> >>> which enabled us to execute code snippets and dynamically control the
> >>> tasks submitted to  a Spark cluster. Spark does not provide the
> >>> command-line shell as a consumable service but it provided us with the
> >>> starting point from which we developed the Spark-Kernel.
> >>>
> >>> == Current Status ==
> >>> Spark-Kernel was first developed by a small team working on an
> >>> internal-IBM Spark-related project in July 2014. In recognition of its
> >>> likely general utility to Spark users and developers, in November 2014
> >> the
> >>> Spark-Kernel project was moved to GitHub and made available under the
> >>> Apache License V2.
> >>>
> >>> == Meritocracy ==
> >>> The current developers are familiar with the meritocratic open source
> >>> development process at Apache. As the project has gathered interest at
> >>> GitHub the developers have actively started a process to invite
> >> additional
> >>> developers into the project, and we have at least one new developer who
> >> is
> >>> ready to contribute code to the project.
> >>>
> >>> == Community ==
> >>> We started building a community around the Spark-Kernel project when we
> >>> moved it to GitHub about one year ago. Since then we have grown to
> about
> >>> 70 people, and there are regular requests and suggestions from the
> >>> community. We believe that providing Apache Spark application
> developers
> >>> with a general-purpose and interactive API holds a lot of community
> >>> potential, especially considering possible tie-in’s with the Jupyter
> and
> >>> data science community.
> >>>
> >>> == Core Developers ==
> >>> The core developers of the project are currently all from IBM, from the
> >>> IBM Emerging Technology team and from IBM’s recently formed Spark
> >>> Technology Center.
> >>>
> >>> == Alignment ==
> >>> Apache, as the home of Apache Spark, is the most natural home for the
> >>> Spark-Kernel project because it was designed to work with Apache Spark
> >> and
> >>> to provide capabilities for interactive applications and data science
> >>> tools not provided by Spark itself.
> >>>
> >>> The Spark-Kernel also has an affinity with Jupyter (jupyter.org)
> because
> >>> it uses the Jupyter protocol for communications, and so Jupyter
> Notebooks
> >>> can directly use the Spark-Kernel as a kernel for communicating with
> >>> Apache Spark. However, we believe that the Spark-Kernel provides a
> >>> general-purpose mechanism enabling a wider variety of applications than
> >>> just Notebooks to access Spark, and so the Spark-Kernel’s greatest
> >>> affinity is with Apache and Apache Spark.
> >>>
> >>> == Known Risks ==
> >>> === Orphaned products ===
> >>> We believe the Spark-Kernel project has a low-risk of abandonment due
> to
> >>> interest in its continuing existence from several parties. More
> >>> specifically, the Spark-Kernel provides a capability that is not
> provided
> >>> by Apache Spark today but it enables a wider range of applications to
> >>> leverage Spark. For example, IBM uses (and is considering) the
> >>> Spark-Kernel in several offerings including its IBM Analytics for
> Apache
> >>> Spark product in the Bluemix Cloud. There are also a couple of other
> >>> commercial users who are using or considering its use in their
> offerings.
> >>> Furthermore, Jupyter Notebooks are used by data scientists and Spark is
> >>> gaining popularity as an analytic engine for them. Jupyter Notebooks
> are
> >>> very easily enabled with the Spark-Kernel and so there is another
> >>> constituency for it.
> >>>
> >>> === Inexperience with Open Source ===
> >>> The Spark-Kernel project has been running as an open-source project
> >>> (albeit with only IBM committers) for the past several months. The
> >> project
> >>> has an active issue tracker and due to the interest indicated by the
> >>> nature and volume of requests and comments, the team has publicly
> stated
> >>> it is beginning to build a process so they can accept third-party
> >>> contributions to the project.
> >>>
> >>> === Relationships with Other Apache Products ===
> >>> The Spark-Kernel has a clear affinity with the Apache Spark project
> >>> because it is designed to  provide capabilities for interactive
> >>> applications and data science tools not provided by Spark itself. The
> >>> Spark-Kernel can be a back-end for the Zeppelin project currently
> >>> incubating at Apache. There is interest from the Spark-Kernel community
> >> to
> >>> develop this capability and an experimental branch has been started.
> >>>
> >>> === Homogeneous Developers ===
> >>> The current group of developers working on Spark-Kernel are all from
> IBM
> >>> although the group is in the process of expanding its membership to
> >>> include members of the GitHub community who are not from IBM and who
> have
> >>> been active in the Spark-Kernel community in GutHub.
> >>>
> >>> === Reliance on Salaried Developers ===
> >>> The initial committers are full-time employees at IBM although not all
> >>> work on the project full-time.
> >>>
> >>> === Excessive Fascination with the Apache Brand ===
> >>> We believe the Spark-Kernel benefits Apache Spark application
> developers,
> >>> and we are interested in an Apache Spark-Kernel project to benefit
> these
> >>> developers by engaging a larger community, facilitating closer ties
> with
> >>> the existing Spark project, and yes, gaining more visibility for the
> >>> Spark-Kernel as a solution.
> >>>
> >>> We have recently become aware that the project name “Spark-Kernel” may
> be
> >>> interpreted as having an association with an Apache project. If the
> >>> project is accepted by Apache, we suggest the project name remains the
> >>> same, but otherwise we will change it to one that does not imply any
> >>> Apache association.
> >>>
> >>> === Documentation ===
> >>> Comprehensive documentation including “Getting Started”, API
> >>> specifications and a Roadmap are available from the GitHub project, see
> >>> https://github.com/ibm-et/spark-kernel/wiki.
> >>>
> >>> === Initial Source ===
> >>> The source code resides at https://github.com/ibm-et/spark-kernel.
> >>>
> >>> === External Dependencies ===
> >>> The Spark-Kernel depends upon a number of Apache projects:
> >>> * Spark
> >>> * Hadoop
> >>> * Ivy
> >>> * Commons
> >>>
> >>> The Spark-Kernel also depends upon a number of other open source
> >> projects:
> >>> * JeroMQ (LGPL with Static Linking Exception,
> >>> http://zeromq.org/area:licensing)
> >>> * Akka (MIT)
> >>> * JOpt Simple (MIT)
> >>> * Spring Framework Core (Apache v2)
> >>> * Play (Apache v2)
> >>> * SLF4J (MIT)
> >>> * Scala
> >>> * Scalatest (Apache v2)
> >>> * Scalactic (Apache v2)
> >>> * Mockito (MIT)
> >>>
> >>> == Required Resources ==
> >>> Developer and user mailing lists
> >>> * priv...@spark-kernel.incubator.apache.org (with moderated
> >> subscriptions)
> >>> * comm...@spark-kernel.incubator.apache.org
> >>> * d...@spark-kernel.incubator.apache.org
> >>> * us...@spark-kernel.incubator.apache.org
> >>>
> >>> A git repository:
> >>> https://git-wip-us.apache.org/repos/asf/incubator-spark-kernel.git
> >>>
> >>> A JIRA issue tracker:
> https://issues.apache.org/jira/browse/SPARK-KERNEL
> >>>
> >>> == Initial Committers ==
> >>> The initial list of committers is:
> >>> * Leugim Bustelo (g...@bustelos.com)
> >>> * Jakob Odersky (joder...@gmail.com)
> >>> * Luciano Resende (lrese...@apache.org)
> >>> * Robert Senkbeil (chip.senkb...@gmail.com)
> >>> * Corey Stubbs (cas5...@gmail.com)
> >>> * Miao Wang (wm...@hotmail.com)
> >>> * Sean Welleck (welle...@gmail.com)
> >>>
> >>> === Affiliations ===
> >>> All of the initial committers are employed by IBM.
> >>>
> >>> == Sponsors ==
> >>> === Champion ===
> >>> * Sam Ruby (IBM)
> >>>
> >>> === Nominated Mentors ===
> >>> * Luciano Resende
> >>>
> >>> We wish to recruit additional mentors during incubation.
> >>>
> >>> === Sponsoring Entity ===
> >>> The Apache Incubator.
> >>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >>> For additional commands, e-mail: general-h...@incubator.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >> For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
> > --
> > --
> > Kind regards,
> > Alexander.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: [DISCUSS] Spark-Kernel Incubator Proposal

Reply via email to