yuhao yang created SPARK-13639:
----------------------------------

             Summary: Statistics.colStats(rdd).mean and variance should handle 
NaN in the input vectors
                 Key: SPARK-13639
                 URL: https://issues.apache.org/jira/browse/SPARK-13639
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: yuhao yang
            Priority: Trivial


   val denseData = Array(
      Vectors.dense(3.8, 0.0, 1.8),
      Vectors.dense(1.7, 0.9, 0.0),
      Vectors.dense(Double.NaN, 0, 0.0)
    )

    val rdd = sc.parallelize(denseData)
    println(Statistics.colStats(rdd).mean)

[NaN,0.3,0.6]

This is just a proposal for discussion on how to handle the NaN value in the 
vectors. We can ignore the NaN value in the computation or just output NaN as 
it is now as a warning.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to