[jira] [Commented] (SPARK-6535) new RDD function that returns intermediate Future

Sean Owen (JIRA) Wed, 25 Mar 2015 12:57:55 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380631#comment-14380631
 ]


Sean Owen commented on SPARK-6535:
----------------------------------

I don't follow. If you map T => U and U => V then nothing happens at all until 
an action is performed, and at that point, it's not true that all U => V waits 
on all T => U. Can you give a concrete example? I doubt that the semantics of 
Future can be used here.

> new RDD function that returns intermediate Future
> -------------------------------------------------
>
>                 Key: SPARK-6535
>                 URL: https://issues.apache.org/jira/browse/SPARK-6535
>             Project: Spark
>          Issue Type: Wish
>          Components: Spark Core
>            Reporter: Eric Johnston
>            Priority: Minor
>              Labels: features, newbie
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I'm suggesting a possible Spark RDD method that I think could give value to a 
> number of people. I'd be interested in thoughts and feedback. Is this a good 
> or bad idea in general? Will it work well, but is too specific for Spark-Core?
> def mapIO[V : ClassTag](f1 : T => Future[U], f2 : U => V, batchSize : Int) : 
> RDD[V]
> The idea is that often times we have an RDD[T] containing metadata, for 
> example a file path or a unique identifier to data in an external database. 
> We would like to retrieve this data, process it, and provide the output as an 
> RDD. Right now, one way to do that is with two map calls: the first being T 
> => U, followed by U => V. However, this will block on all T => U IO 
> operations. By wrapping U in a Future, this problem is avoided. The 
> "batchSize" is added because we do not want to create a future for every row 
> in a partition -- we may get too much data back at once. The batchSize limits 
> the number of outstanding Futures within a partition. Ideally this number is 
> set to be big enough so that there is always data ready to process, but small 
> enough that not too much data is pulled at any one time. We could potentially 
> default the batchSize to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6535) new RDD function that returns intermediate Future

Reply via email to