[
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291797#comment-14291797
]
Robert Stupp commented on SPARK-2389:
-------------------------------------
bq. That aside, why doesn't it scale?
Simply because it's just a single Spark client. If that machine's at its limit
for whatever reason (VM memory, OS resources, CPU, network, ...), that's it.
Sure, you can run multiple drivers - but each has its own, private set of data.
IMO separate preloading is nice for some applications. But data is usually not
immutable. By example:
* Imagine an application that provides offers for flights worldwide. It's a
huge amount of data and a huge amount of processing. It cannot be simply
preloaded - prices for tickets vary from minute to minute based on booking
status etc etc etc
* Overall data set is quite big
* Overall load is too big for a single driver to handle - imagine thousands of
offer requests per second
* Failure of a single driver is an absolute no-go
* All clients have to access the same set of data
* Preloading is just impossible during runtime (just at initial deployment)
So - a suitable approach would be to have:
* a Spark cluster holding all the RDDs and doing all offer and booking related
operations
* a set of Spark clients to "abstract" Spark from the rest of the application
* a huge number of non-uniform frontend clients (could be web app servers, rich
clients, SOAP / REST frontends)
* everything (except the data) stateless
> globally shared SparkContext / shared Spark "application"
> ---------------------------------------------------------
>
> Key: SPARK-2389
> URL: https://issues.apache.org/jira/browse/SPARK-2389
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the
> duration of the whole application* and run tasks in multiple threads. This
> has the benefit of isolating applications from each other, on both the
> scheduling side (each driver schedules its own tasks) and executor side
> (tasks from different applications run in different JVMs). However, it also
> means that *data cannot be shared* across different Spark applications
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of
> --driver-- client processes to share executors and to share (persistent /
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web
> app servers) that want to use Spark as a _big computing machine_. Most
> important is the fact that Spark is quite good in caching/persisting data in
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its
> materialized state.
> With such a feature the overall performance of today's web applications could
> then be increased by adding more web app servers, more spark nodes, more
> nosql nodes etc
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]