Greg Bowyer created SPARK-15861: ----------------------------------- Summary: pyspark mapPartitions with none generator functions / functors Key: SPARK-15861 URL: https://issues.apache.org/jira/browse/SPARK-15861 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Reporter: Greg Bowyer Priority: Minor
Hi all, it appears that the method `rdd.mapPartitions` does odd things if it is fed a normal subroutine. For instance, lets say we have the following {code:python} rows = range(25) rows = [rows[i:i+5] for i in range(0, len(rows), 5)] rdd = sc.parallelize(rows) def to_np(data): return np.array(list(data)) rdd.mapPartitions(to_np).collect() ... [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14]), array([15, 16, 17, 18, 19]), array([20, 21, 22, 23, 24])] rdd.mapPartitions(to_np, preservePartitioning=True).collect() ... [array([0, 1, 2, 3, 4]), array([5, 6, 7, 8, 9]), array([10, 11, 12, 13, 14]), array([15, 16, 17, 18, 19]), array([20, 21, 22, 23, 24])] {code} This basically makes the provided function that did return act like the end user called {code}rdd.map{code} I think that maybe a check should be put in to call {code:python}inspect.isgeneratorfunction{code} ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org