[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

Patrick Wendell (JIRA) Thu, 19 Feb 2015 10:27:41 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327903#comment-14327903
 ]


Patrick Wendell commented on SPARK-2389:
----------------------------------------

I've seen some variants of this question over time. Usually the set of 
requirements from the user is like this:

1. We are building a long lived application that uses a single Spark Context 
containing cached RDD's, and it dispatches requests from multiple users.
2. We want it to be fault tolerant, so using a single point of failure like the 
shared job server or single dispatcher isn't acceptable.
3. We don't want to pay the serialization cost of going to a filesystem like 
Tachyon, due to latency.

Then the "request" then whether we can make the driver program fault tolerant 
or somehow have the state of the active RDD's and execution in an ongoing Spark 
context stored persistently. Unfortunately, the RDD meta data and execution 
state in the driver is arbitrary state (a driver is just a Java process), and 
it's not possible to take any user program in Spark and make this state 
entirely recoverable on process failure. If we started to go down this path, 
we'd need to do things like define a standard serialization format for the RDD 
data, a global namespace for RDD's, persistence, etc. And then you're building 
a filesystem.

The real solution here is that applications need to provide resiliency 
themselves by architecting in a way where they either entirely keep state in a 
filesystem (and dispatch requests by reading from persistent storage), or they 
use caching in a way where that cache is soft state and can be recovered from 
some persistent storage if there is a failure, maybe with some temporary 
performance degradation. The Spark ecosystem already has H/A for some 
components, such as Streaming, and we achieved that by exploiting specifics of 
the architecture of a streaming program and allowing them to recover from 
checkpoints, etc.

In the future there will be a few major changes in Spark that make this whole 
thing much easier. The first is that we'll likely write extremely fast 
serializers for RDD's that have structure (SchemaRDD/DataFrame)... along with 
in-memory filesystems and formats that provide predicate pushdown and other 
optimizations, this will likely close the gap substantially between latency 
experienced for on-heap RDD's and those in persistent storage. Second, we may 
add H/A in other specific components of Spark, such as the JDBC server, where 
we can exploit specifics of the user-facing interface to allow fast 
recoverability. Then applications that write against those API's do not need to 
reason about H/A at all.

Hopefully that was a helpful perspective!

> globally shared SparkContext / shared Spark "application"
> ---------------------------------------------------------
>
>                 Key: SPARK-2389
>                 URL: https://issues.apache.org/jira/browse/SPARK-2389
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the 
> duration of the whole application* and run tasks in multiple threads. This 
> has the benefit of isolating applications from each other, on both the 
> scheduling side (each driver schedules its own tasks) and executor side 
> (tasks from different applications run in different JVMs). However, it also 
> means that *data cannot be shared* across different Spark applications 
> (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of 
> --driver-- client processes to share executors and to share (persistent / 
> cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web 
> app servers) that want to use Spark as a _big computing machine_. Most 
> important is the fact that Spark is quite good in caching/persisting data in 
> memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs 
> operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on 
> the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently 
> against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its 
> materialized state.
> With such a feature the overall performance of today's web applications could 
> then be increased by adding more web app servers, more spark nodes, more 
> nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

Reply via email to