[ 
https://issues.apache.org/jira/browse/SPARK-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380956#comment-14380956
 ] 

Eric Johnston commented on SPARK-6535:
--------------------------------------

Thanks for the comment, Sean.

Here's my best shot at an example:

val myRdd = /* An RDD[String], which is a list of file names */

myRdd.map(x => getFileSource(x)).map(x => DoSomethingCPUIntensive(x)).collect()

Now if I have 8 cores, I have 8 CPU's simultaneously grabbing the file sources 
and then processing them when they arrive (so yes, not all U => V wait on all T 
=> U). The problem I see is that when each thread goes to grab a source file, 
it needs to wait on that file until it returns so that it can continue 
processing it. As a result, most of the CPU time will be spent idle waiting for 
source files to be loaded.

The solution I see to this is to have the IO operation return a Future such 
that the CPU can now spend most of it's time on the CPU intensive work. In 
other words, a single thread would fire off several file source requests, and 
then begin processing files it already has. By the time it has finished 
processing those files, more source files will become available because their 
Futures will have completed. In this way, the bulk of the CPU time is spent 
doing the processing rather than waiting on files to become available.

> new RDD function that returns intermediate Future
> -------------------------------------------------
>
>                 Key: SPARK-6535
>                 URL: https://issues.apache.org/jira/browse/SPARK-6535
>             Project: Spark
>          Issue Type: Wish
>          Components: Spark Core
>            Reporter: Eric Johnston
>            Priority: Minor
>              Labels: features, newbie
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I'm suggesting a possible Spark RDD method that I think could give value to a 
> number of people. I'd be interested in thoughts and feedback. Is this a good 
> or bad idea in general? Will it work well, but is too specific for Spark-Core?
> def mapIO[V : ClassTag](f1 : T => Future[U], f2 : U => V, batchSize : Int) : 
> RDD[V]
> The idea is that often times we have an RDD[T] containing metadata, for 
> example a file path or a unique identifier to data in an external database. 
> We would like to retrieve this data, process it, and provide the output as an 
> RDD. Right now, one way to do that is with two map calls: the first being T 
> => U, followed by U => V. However, this will block on all T => U IO 
> operations. By wrapping U in a Future, this problem is avoided. The 
> "batchSize" is added because we do not want to create a future for every row 
> in a partition -- we may get too much data back at once. The batchSize limits 
> the number of outstanding Futures within a partition. Ideally this number is 
> set to be big enough so that there is always data ready to process, but small 
> enough that not too much data is pulled at any one time. We could potentially 
> default the batchSize to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to