[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273225#comment-14273225
 ] 

Patrick Wendell commented on SPARK-3561:
----------------------------------------

So if the question is: "Is Spark only API or is it an integrated API/execution 
engine"... we've taken a fairly clear stance over the history of the project 
that it's an integrated engine. I.e. Spark is not something like Pig where it's 
intended primarily as a user API and we expect there to be different physical 
execution engines plugged in underneath.

In the past we haven't found this prevents Spark from working well in different 
environments. For instance, with Mesos, on YARN, etc. And for this we've 
integrated at different layers such as the storage layer and the scheduling 
layer, where there were well defined API's and integration points in the 
broader ecosystem. Compared with alternatives Spark is far more flexible in 
terms of runtime environments. The RDD API is so generic that it's very easy to 
customize and integrate.

For this reason, my feeling with decoupling execution from the rest of Spark is 
that it would tie our hands architecturally and not add much benefit. I don't 
see a good reason to make this broader change in the strategy of the project.

If there are specific improvements you see for making Spark work well on YARN, 
then we can definitely look at them.

> Allow for pluggable execution contexts in Spark
> -----------------------------------------------
>
>                 Key: SPARK-3561
>                 URL: https://issues.apache.org/jira/browse/SPARK-3561
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Oleg Zhurakousky
>              Labels: features
>         Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal: 
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@Experimental) not exposed to end users of Spark. 
> The trait will define 6 operations: 
> * hadoopFile 
> * newAPIHadoopFile 
> * broadcast 
> * runJob 
> * persist
> * unpersist
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext. 
> Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to