[
https://issues.apache.org/jira/browse/SPARK-6535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381054#comment-14381054
]
Sean Owen commented on SPARK-6535:
----------------------------------
OK, so for some unit of input, you are able to A) do some number crunching on
the input straight away while B) waiting for a database result to come back.
That's just something you can write into your own function, with a Future if
you like. I suppose you could also somehow compute one RDD of result A and
another the results of B and then join them. It's probably significantly slower
and a bit more complex though. I don't believe this needs new semantics in
Spark either way though.
> new RDD function that returns intermediate Future
> -------------------------------------------------
>
> Key: SPARK-6535
> URL: https://issues.apache.org/jira/browse/SPARK-6535
> Project: Spark
> Issue Type: Wish
> Components: Spark Core
> Reporter: Eric Johnston
> Priority: Minor
> Labels: features, newbie
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> I'm suggesting a possible Spark RDD method that I think could give value to a
> number of people. I'd be interested in thoughts and feedback. Is this a good
> or bad idea in general? Will it work well, but is too specific for Spark-Core?
> def mapIO[V : ClassTag](f1 : T => Future[U], f2 : U => V, batchSize : Int) :
> RDD[V]
> The idea is that often times we have an RDD[T] containing metadata, for
> example a file path or a unique identifier to data in an external database.
> We would like to retrieve this data, process it, and provide the output as an
> RDD. Right now, one way to do that is with two map calls: the first being T
> => U, followed by U => V. However, this will block on all T => U IO
> operations. By wrapping U in a Future, this problem is avoided. The
> "batchSize" is added because we do not want to create a future for every row
> in a partition -- we may get too much data back at once. The batchSize limits
> the number of outstanding Futures within a partition. Ideally this number is
> set to be big enough so that there is always data ready to process, but small
> enough that not too much data is pulled at any one time. We could potentially
> default the batchSize to 1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]