[
https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079984#comment-14079984
]
Jeremy Freeman commented on SPARK-2012:
---------------------------------------
[~davies] cool, that definitely makes sense to me, shall I put a PR together
done that way?
> PySpark StatCounter with numpy arrays
> -------------------------------------
>
> Key: SPARK-2012
> URL: https://issues.apache.org/jira/browse/SPARK-2012
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 1.0.0
> Reporter: Jeremy Freeman
> Priority: Minor
>
> In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy
> arrays just as with an RDD of scalars, which was very useful (e.g. for
> computing stats on a set of vectors in ML analyses). In 1.0.0 this broke
> because the added functionality for computing the minimum and maximum, as
> implemented, doesn't work on arrays.
> I have a PR ready that re-enables this functionality by having StatCounter
> use the numpy element-wise functions "maximum" and "minimum", which work on
> both numpy arrays and scalars (and I've added new tests for this capability).
> However, I realize this adds a dependency on NumPy outside of MLLib. If
> that's not ok, maybe it'd be worth adding this functionality as a util within
> PySpark MLLib?
--
This message was sent by Atlassian JIRA
(v6.2#6252)