Kuku1 created SPARK-18562:
-----------------------------
Summary: Correlation causes Error “Cannot determine the number of
cols”
Key: SPARK-18562
URL: https://issues.apache.org/jira/browse/SPARK-18562
Project: Spark
Issue Type: Bug
Affects Versions: 1.6.1
Environment: Ubuntu 14.04LTS
Reporter: Kuku1
I followed the MLlib docs on how to calculate a correlation. I'm using Spark
1.6.1.
First my application filters out elements that do not have all the values I'm
looking for. Afterwards, I'm mapping each of the remaining elements to a dense
Vector, as seen in the docs. Then I'm passing the RDD[Vector] to the MLlib
function.
My code is the following:
{code}
val filteredRdd = rdd.filter(document => document.containsKey("SomeValue1")
&& document.containsKey("SomeValue2") && document.containsKey("SomeValue3"))
val vectorRdd: RDD[Vector] = filteredRdd.map(document => {
Vectors.dense(document.getDouble("SomeValue1"),
document.getDouble("SomeValue2"), document.getDouble("SomeValue3"))
})
val correlation_matrix = Statistics.corr(vectorRdd, method = "spearman")
println("Spearman: " + correlation_matrix.toString())
val correlation_matrix_pearson = Statistics.corr(vectorRdd, method = "pearson")
println("Pearson: " + correlation_matrix_pearson.toString())
{code}
This is the error that gets thrown:
{code}
16/11/23 13:19:51 ERROR ApplicationMaster: User class threw exception:
java.lang.RuntimeException: Cannot determine the number of cols because it is
not specified in the constructor and the rows RDD is empty.
java.lang.RuntimeException: Cannot determine the number of cols because it is
not specified in the constructor and the rows RDD is empty.
at scala.sys.package$.error(package.scala:27)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
at
org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:328)
at
org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
at
org.apache.spark.mllib.stat.correlation.SpearmanCorrelation$.computeCorrelationMatrix(SpearmanCorrelation.scala:91)
at
org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
{code}
Because I filter out the elements which would cause an empty vector, I don't
see how this is related to my code? Thus I created this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]