[ 
https://issues.apache.org/jira/browse/SPARK-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15328245#comment-15328245
 ] 

Bryan Cutler commented on SPARK-15861:
--------------------------------------

[~gbow...@fastmail.co.uk]

{{mapPartitions}} expects a function the takes an iterator as input then 
outputs an iterable sequence, and your function in the example is actually 
providing this.  I think what is going on here is your function will map the 
iterator to a numpy array, that internally will be something like  
{noformat}array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]){noformat} for the first 
partition, then {{collect}} will iterate over that sequence and return each 
element, which will also be a numpy array, so you get {noformat}array([0, 1, 2, 
3, 4]), array([5, 6, 7, 8, 9])) {noformat} for the first 2 elements and so on..

I believe this is working as it is supposed to, and in general, 
{{mapPartitions}} will not usually give the same result as {{map}} - it will 
fail if the function does not return a valid sequence.  The documentation could 
perhaps be a little clearer in that regard.

> pyspark mapPartitions with none generator functions / functors
> --------------------------------------------------------------
>
>                 Key: SPARK-15861
>                 URL: https://issues.apache.org/jira/browse/SPARK-15861
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.1
>            Reporter: Greg Bowyer
>            Priority: Minor
>
> Hi all, it appears that the method `rdd.mapPartitions` does odd things if it 
> is fed a normal subroutine.
> For instance, lets say we have the following
> {code}
> rows = range(25)
> rows = [rows[i:i+5] for i in range(0, len(rows), 5)]
> rdd = sc.parallelize(rows, 2)
> def to_np(data):
>     return np.array(list(data))
> rdd.mapPartitions(to_np).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> rdd.mapPartitions(to_np, preservePartitioning=True).collect()
> ...
> [array([0, 1, 2, 3, 4]),
>  array([5, 6, 7, 8, 9]),
>  array([10, 11, 12, 13, 14]),
>  array([15, 16, 17, 18, 19]),
>  array([20, 21, 22, 23, 24])]
> {code}
> This basically makes the provided function that did return act like the end 
> user called {code}rdd.map{code}
> I think that maybe a check should be put in to call 
> {code}inspect.isgeneratorfunction{code}
> ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to