yuhao yang created SPARK-13639:
----------------------------------
Summary: Statistics.colStats(rdd).mean and variance should handle
NaN in the input vectors
Key: SPARK-13639
URL: https://issues.apache.org/jira/browse/SPARK-13639
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: yuhao yang
Priority: Trivial
val denseData = Array(
Vectors.dense(3.8, 0.0, 1.8),
Vectors.dense(1.7, 0.9, 0.0),
Vectors.dense(Double.NaN, 0, 0.0)
)
val rdd = sc.parallelize(denseData)
println(Statistics.colStats(rdd).mean)
[NaN,0.3,0.6]
This is just a proposal for discussion on how to handle the NaN value in the
vectors. We can ignore the NaN value in the computation or just output NaN as
it is now as a warning.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]