Bryan Cutler created SPARK-26573: ------------------------------------ Summary: Python worker not reused with mapPartitions if not consuming iterator Key: SPARK-26573 URL: https://issues.apache.org/jira/browse/SPARK-26573 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Bryan Cutler
In PySpark, if the user calls RDD mapPartitions and does not consume the iterator, the Python worker will read the wrong signal and not be reused. Test to replicate: {code:java} def test_worker_reused_in_map_partition(self): def map_pid(iterator): # Fails when iterator not consumed, e.g. len(list(iterator)) return (os.getpid() for _ in xrange(10)) rdd = self.sc.parallelize([], 10) worker_pids_a = rdd.mapPartitions(map_pid).collect() worker_pids_b = rdd.mapPartitions(map_pid).collect() self.assertTrue(all([pid in worker_pids_a for pid in worker_pids_b])){code} This is related to SPARK-26549 which fixes this issue, but only for use in rdd.range -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org