[
https://issues.apache.org/jira/browse/SPARK-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335518#comment-14335518
]
Corey J. Nolet commented on SPARK-5140:
---------------------------------------
I have a framework (similar to cascading) where users wire together their RDD
transformations in reusable components and the framework wires them up together
based on dependencies that they define. There is a notion of sinks- or data
outputs and they are also encapsulated in reusable componentry. A sink depends
on sources- The sinks are actually what get executed. I wanted to be able to
execute them in parallel and just have spark be smart enough to figure out what
needs to block (whenever there's a cached rdd) in order to make that possible.
We're comparing the same data transformations to MapReduce that was written in
JAQL and the single-threaded execution of the sinks is causing absolutely
wretched run times.
I'm not saying an internal requirement from my framework should necessaily be
levied against the spark features- however, this would change (seeming like it
would only affect the driver side scheduling) would allow the spark context to
be truly thread-safe. I've tried running jobs that over-utilize the resources
in parallel until one of them fully caches and it really just slows things down
and uses too many resources- I can't see why anyone submitting rdds in
different threads that depend on cached RDDs wouldn't want those threads to
block for the parents but maybe I'm not thinking abstract enough.
For now, I'm going to propagate down my data source tree breadth first and
no-op those sources that return CachedRDDs so that their children can be
scheduled in different threads.
> Two RDDs which are scheduled concurrently should be able to wait on parent in
> all cases
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-5140
> URL: https://issues.apache.org/jira/browse/SPARK-5140
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Reporter: Corey J. Nolet
> Labels: features
>
> Not sure if this would change too much of the internals to be included in the
> 1.2.1 but it would be very helpful if it could be.
> This ticket is from a discussion between myself and [~ilikerps]. Here's the
> result of some testing that [~ilikerps] did:
> bq. I did some testing as well, and it turns out the "wait for other guy to
> finish caching" logic is on a per-task basis, and it only works on tasks that
> happen to be executing on the same machine.
> bq. Once a partition is cached, we will schedule tasks that touch that
> partition on that executor. The problem here, though, is that the cache is in
> progress, and so the tasks are still scheduled randomly (or with whatever
> locality the data source has), so tasks which end up on different machines
> will not see that the cache is already in progress.
> {code}
> Here was my test, by the way:
> import scala.concurrent.ExecutionContext.Implicits.global
> import scala.concurrent._
> import scala.concurrent.duration._
> val rdd = sc.parallelize(0 until 8).map(i => { Thread.sleep(10000); i
> }).cache()
> val futures = (0 until 4).map { _ => Future { rdd.count } }
> Await.result(Future.sequence(futures), 120.second)
> {code}
> bq. Note that I run the future 4 times in parallel. I found that the first
> run has all tasks take 10 seconds. The second has about 50% of its tasks take
> 10 seconds, and the rest just wait for the first stage to finish. The last
> two runs have no tasks that take 10 seconds; all wait for the first two
> stages to finish.
> What we want is the ability to fire off a job and have the DAG figure out
> that two RDDs depend on the same parent so that when the children are
> scheduled concurrently, the first one to start will activate the parent and
> both will wait on the parent. When the parent is done, they will both be able
> to finish their work concurrently. We are trying to use this pattern by
> having the parent cache results.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]