[
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325007#comment-15325007
]
Sean Owen commented on SPARK-15861:
-----------------------------------
Got it. That does look odd. I doubt the explanation is that "mapPartitions
works like map in this case" but I also don't know enough Python to know what
would make the difference. What's an example of something that does work as
expected? your snippet is cut off at the end of your patch description.
> pyspark mapPartitions with none generator functions / functors
> --------------------------------------------------------------
>
> Key: SPARK-15861
> URL: https://issues.apache.org/jira/browse/SPARK-15861
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.6.1
> Reporter: Greg Bowyer
> Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
> return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
> array([5, 6, 7, 8, 9]),
> array([10, 11, 12, 13, 14]),
> array([15, 16, 17, 18, 19]),
> array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
> array([5, 6, 7, 8, 9]),
> array([10, 11, 12, 13, 14]),
> array([15, 16, 17, 18, 19]),
> array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call
> {code}inspect.isgeneratorfunction{code}
> ?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]