Bryan Cutler created SPARK-26573:
------------------------------------

             Summary: Python worker not reused with mapPartitions if not 
consuming iterator
                 Key: SPARK-26573
                 URL: https://issues.apache.org/jira/browse/SPARK-26573
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.0
            Reporter: Bryan Cutler


In PySpark, if the user calls RDD mapPartitions and does not consume the 
iterator, the Python worker will read the wrong signal and not be reused.  Test 
to replicate:
{code:java}
def test_worker_reused_in_map_partition(self):

    def map_pid(iterator):
        # Fails when iterator not consumed, e.g. len(list(iterator))
        return (os.getpid() for _ in xrange(10))

    rdd = self.sc.parallelize([], 10)

    worker_pids_a = rdd.mapPartitions(map_pid).collect()
    worker_pids_b = rdd.mapPartitions(map_pid).collect()

    self.assertTrue(all([pid in worker_pids_a for pid in worker_pids_b])){code}

This is related to SPARK-26549 which fixes this issue, but only for use in 
rdd.range



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to