[ 
https://issues.apache.org/jira/browse/SPARK-13700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183428#comment-15183428
 ] 

Sean Owen commented on SPARK-13700:
-----------------------------------

mapPartitions gives you an iterator of elements; you return a iterator over 
mapped elements. I was _going_ to say you just make it a parallel iterator with 
{{.par}}  and then map as usual, but this is not actually available on 
{{Iterator}}. Hm. I'm guessing there's some way to process an iterator in 
parallel that isn't too much code, but it wasn't that one-liner. 

Coming at it another way, I assume that you're bottlenecked on the database or 
other resource you're querying. You probably get about the same optimal 
behavior by repartitioning to match the number of concurrent connections that 
maxes out the resource, and then processing the partitions in one connection in 
serial. That assumes that repartitioning down to the right max isn't that 
expensive, but it need not involve a shuffle.

> Rdd.mapAsync(): Easily mix Spark and asynchroneous transformation
> -----------------------------------------------------------------
>
>                 Key: SPARK-13700
>                 URL: https://issues.apache.org/jira/browse/SPARK-13700
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Paulo Costa
>            Priority: Minor
>              Labels: async, features, rdd, transform
>
> Spark is great for synchronous operations.
> But sometimes I need to call a database/web server/etc from my transform, and 
> the Spark pipeline stalls waiting for it.
> Avoiding that would be great!
> I suggest we add a new method RDD.mapAsync(), which can execute these 
> operations concurrently, avoiding the bottleneck.
> I've written a quick'n'dirty implementation of what I have in mind: 
> https://gist.github.com/paulo-raca/d121cf27905cfb1fafc3
> What do you think?
> If you agree with this feature, I can work on a pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to