[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291797#comment-14291797
 ] 

Robert Stupp commented on SPARK-2389:
-------------------------------------

bq. That aside, why doesn't it scale?

Simply because it's just a single Spark client. If that machine's at its limit 
for whatever reason (VM memory, OS resources, CPU, network, ...), that's it.

Sure, you can run multiple drivers - but each has its own, private set of data.

IMO separate preloading is nice for some applications. But data is usually not 
immutable. By example:
* Imagine an application that provides offers for flights worldwide. It's a 
huge amount of data and a huge amount of processing. It cannot be simply 
preloaded - prices for tickets vary from minute to minute based on booking 
status etc etc etc
* Overall data set is quite big
* Overall load is too big for a single driver to handle - imagine thousands of 
offer requests per second
* Failure of a single driver is an absolute no-go
* All clients have to access the same set of data
* Preloading is just impossible during runtime (just at initial deployment)

So - a suitable approach would be to have:
* a Spark cluster holding all the RDDs and doing all offer and booking related 
operations
* a set of Spark clients to "abstract" Spark from the rest of the application
* a huge number of non-uniform frontend clients (could be web app servers, rich 
clients, SOAP / REST frontends)
* everything (except the data) stateless

> globally shared SparkContext / shared Spark "application"
> ---------------------------------------------------------
>
>                 Key: SPARK-2389
>                 URL: https://issues.apache.org/jira/browse/SPARK-2389
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to