[ https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077056#comment-14077056 ]
Davies Liu commented on SPARK-2012: ----------------------------------- Maybe we could try to use numpy.minimum and fallback to `min` if no numpy available. It will work without numpy, also will work for array in numpy. @Jeremy Freeman, is it OK? > PySpark StatCounter with numpy arrays > ------------------------------------- > > Key: SPARK-2012 > URL: https://issues.apache.org/jira/browse/SPARK-2012 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 1.0.0 > Reporter: Jeremy Freeman > Priority: Minor > > In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy > arrays just as with an RDD of scalars, which was very useful (e.g. for > computing stats on a set of vectors in ML analyses). In 1.0.0 this broke > because the added functionality for computing the minimum and maximum, as > implemented, doesn't work on arrays. > I have a PR ready that re-enables this functionality by having StatCounter > use the numpy element-wise functions "maximum" and "minimum", which work on > both numpy arrays and scalars (and I've added new tests for this capability). > However, I realize this adds a dependency on NumPy outside of MLLib. If > that's not ok, maybe it'd be worth adding this functionality as a util within > PySpark MLLib? -- This message was sent by Atlassian JIRA (v6.2#6252)