[ 
https://issues.apache.org/jira/browse/SPARK-20347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970933#comment-15970933
 ] 

Maciej Szymkiewicz commented on SPARK-20347:
--------------------------------------------

This is a nice idea but I wonder what would be the benefit over using 
{{concurrent.futures}}? These work just fine with PySpark (at least in simple 
case) and the only overhead is creating the executor. It is not like we have 
worry about GIL here.

{code}
from pyspark.rdd import RDD
from pyspark import SparkConf, SparkContext
from concurrent.futures import ThreadPoolExecutor


def _get_executor():
    sc = SparkContext.getOrCreate()
    if not hasattr(sc, "_thread_pool_executor"):
        max_workers = int(sc.getConf().get("spark.driver.cores") or 1)
        sc._thread_pool_executor = ThreadPoolExecutor(max_workers=max_workers)
    return sc._thread_pool_executor

def asyncCount(self):
    return _get_executor().submit(self.count)

def foreachAsync(self, f):
    return _get_executor().submit(self.foreach, f)

RDD.asyncCount = asyncCount
RDD.foreachAsync = foreachAsync

{code}

> Provide AsyncRDDActions in Python
> ---------------------------------
>
>                 Key: SPARK-20347
>                 URL: https://issues.apache.org/jira/browse/SPARK-20347
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.2.0
>            Reporter: holdenk
>            Priority: Minor
>
> In core Spark AsyncRDDActions allows people to perform non-blocking RDD 
> actions. In Python where threading & is a bit more involved there could be 
> value in exposing this, the easiest way might involve using the Py4J callback 
> server on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to