[
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742519#comment-16742519
]
Alessandro Bellina commented on SPARK-26257:
--------------------------------------------
[~tcondie] I think you should share the design doc here too. It's a good idea,
just curious on the details of what's being proposed.
> SPIP: Interop Support for Spark Language Extensions
> ---------------------------------------------------
>
> Key: SPARK-26257
> URL: https://issues.apache.org/jira/browse/SPARK-26257
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, R, Spark Core
> Affects Versions: 2.4.0
> Reporter: Tyson Condie
> Priority: Minor
>
> h2. ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some
> notable examples include:
> * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
> * Haskell from project sparkle [https://github.com/tweag/sparkle]
> * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated
> interop layer. It would seem that much of that existing interop layer could
> be refactored into a clean surface for general (third party) language
> bindings, such as the above mentioned. More specifically, could we generalize
> the following modules:
> * Deploy runners (e.g., PythonRunner and RRunner)
> * DataFrame Executors
> * RDD operations?
> The last being questionable: integrating third party language extensions at
> the RDD level may be too heavy-weight and unnecessary given the preference
> towards the Dataframe abstraction.
> The main goals of this effort would be:
> * Provide a clean abstraction for third party language extensions making it
> easier to maintain (the language extension) with the evolution of Apache Spark
> * Provide guidance to third party language authors on how a language
> extension should be implemented
> * Provide general reusable libraries that are not specific to any language
> extension
> * Open the door to developers that prefer alternative languages
> * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in
> languages other than what’s natively supported. Library developers will be
> able to create language extensions for Spark in a clean way. The interop
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it
> aims to provide a stable interop layer for third party language extensions to
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark
> and R, specifically for the Dataframe API. For instance, it would be good to
> have a general deploy.Runner (similar to PythonRunner) for language extension
> efforts. In Spark SQL, it would be good to have a general InteropUDF and
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering;
> depending on the success of the interop layer and with proper demand, RDD
> interop could be added later. However, one open question is supporting a
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
> Phase 1: Introduce general interop API for deploying a driver/application,
> running an interop UDF along with any other low-level transformations that
> aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop
> layer. This port should be contained solely to the Spark core side, and all
> protocols specific to Python and R should not change e.g., Python should
> continue to use py4j is the protocol between the Python process and core
> Spark. The port itself should be contained to a handful of files e.g., some
> examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+,
> PythonRDD (possibly), and will mostly involve refactoring common logic
> abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a
> third-party language extension to Spark do so directly; often by extending
> existing Python classes and overriding the portions that are relevant to the
> new extension. Not only is this not sound code (e.g., an JuliaRDD is not a
> PythonRDD, which contains a lot of reusable code), but it runs the great risk
> of future revisions making the subclass implementation obsolete. It would be
> hard to imagine that any third-party language extension would be successful
> if there was not something in place to guarantee its long-term
> maintainability.
> Another alternative is that third-party languages should only interact with
> Spark via pure-SQL; possibly via REST. However, this does not enable UDFs
> written in the third-party language; a key desideratum in this effort, which
> most notably takes the form of legacy code/UDFs that would need to be ported
> to a supported language e.g., Scala. This exercise is extremely cumbersome
> and not always feasible due to the code no longer being available i.e., only
> the compiled library exists.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]