[ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291797#comment-14291797 ]
Robert Stupp commented on SPARK-2389: ------------------------------------- bq. That aside, why doesn't it scale? Simply because it's just a single Spark client. If that machine's at its limit for whatever reason (VM memory, OS resources, CPU, network, ...), that's it. Sure, you can run multiple drivers - but each has its own, private set of data. IMO separate preloading is nice for some applications. But data is usually not immutable. By example: * Imagine an application that provides offers for flights worldwide. It's a huge amount of data and a huge amount of processing. It cannot be simply preloaded - prices for tickets vary from minute to minute based on booking status etc etc etc * Overall data set is quite big * Overall load is too big for a single driver to handle - imagine thousands of offer requests per second * Failure of a single driver is an absolute no-go * All clients have to access the same set of data * Preloading is just impossible during runtime (just at initial deployment) So - a suitable approach would be to have: * a Spark cluster holding all the RDDs and doing all offer and booking related operations * a set of Spark clients to "abstract" Spark from the rest of the application * a huge number of non-uniform frontend clients (could be web app servers, rich clients, SOAP / REST frontends) * everything (except the data) stateless > globally shared SparkContext / shared Spark "application" > --------------------------------------------------------- > > Key: SPARK-2389 > URL: https://issues.apache.org/jira/browse/SPARK-2389 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Robert Stupp > > The documentation (in Cluster Mode Overview) cites: > bq. Each application gets its own executor processes, which *stay up for the > duration of the whole application* and run tasks in multiple threads. This > has the benefit of isolating applications from each other, on both the > scheduling side (each driver schedules its own tasks) and executor side > (tasks from different applications run in different JVMs). However, it also > means that *data cannot be shared* across different Spark applications > (instances of SparkContext) without writing it to an external storage system. > IMO this is a limitation that should be lifted to support any number of > --driver-- client processes to share executors and to share (persistent / > cached) data. > This is especially useful if you have a bunch of frontend servers (dump web > app servers) that want to use Spark as a _big computing machine_. Most > important is the fact that Spark is quite good in caching/persisting data in > memory / on disk thus removing load from backend data stores. > Means: it would be really great to let different --driver-- client JVMs > operate on the same RDDs and benefit from Spark's caching/persistence. > It would however introduce some administration mechanisms to > * start a shared context > * update the executor configuration (# of worker nodes, # of cpus, etc) on > the fly > * stop a shared context > Even "conventional" batch MR applications would benefit if ran fequently > against the same data set. > As an implicit requirement, RDD persistence could get a TTL for its > materialized state. > With such a feature the overall performance of today's web applications could > then be increased by adding more web app servers, more spark nodes, more > nosql nodes etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org