[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays

Davies Liu (JIRA) Mon, 28 Jul 2014 15:44:17 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077056#comment-14077056
 ]


Davies Liu commented on SPARK-2012:
-----------------------------------

Maybe we could try to use numpy.minimum and fallback to `min` if no numpy 
available.

It will work without numpy, also will work for array in numpy.

@Jeremy Freeman, is it OK?

> PySpark StatCounter with numpy arrays
> -------------------------------------
>
>                 Key: SPARK-2012
>                 URL: https://issues.apache.org/jira/browse/SPARK-2012
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.0.0
>            Reporter: Jeremy Freeman
>            Priority: Minor
>
> In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy 
> arrays just as with an RDD of scalars, which was very useful (e.g. for 
> computing stats on a set of vectors in ML analyses). In 1.0.0 this broke 
> because the added functionality for computing the minimum and maximum, as 
> implemented, doesn't work on arrays.
> I have a PR ready that re-enables this functionality by having StatCounter 
> use the numpy element-wise functions "maximum" and "minimum", which work on 
> both numpy arrays and scalars (and I've added new tests for this capability). 
> However, I realize this adds a dependency on NumPy outside of MLLib. If 
> that's not ok, maybe it'd be worth adding this functionality as a util within 
> PySpark MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays

Reply via email to