[ 
https://issues.apache.org/jira/browse/SPARK-18562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuku1 closed SPARK-18562.
-------------------------
    Resolution: Not A Bug

I did a mistake, my RDD did get corrupt at one point and thus causing empty 
RDDs. 

> Correlation causes Error “Cannot determine the number of cols”
> --------------------------------------------------------------
>
>                 Key: SPARK-18562
>                 URL: https://issues.apache.org/jira/browse/SPARK-18562
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.6.1
>         Environment: Ubuntu 14.04LTS
>            Reporter: Kuku1
>
> I followed the MLlib docs on how to calculate a correlation. I'm using Spark 
> 1.6.1.
> First my application filters out elements that do not have all the values I'm 
> looking for. Afterwards, I'm mapping each of the remaining elements to a 
> dense Vector, as seen in the docs. Then I'm passing the RDD[Vector] to the 
> MLlib function.
> My code is the following:
> {code}
> val filteredRdd = rdd.filter(document => document.containsKey("SomeValue1")
>   && document.containsKey("SomeValue2") && document.containsKey("SomeValue3"))
> val vectorRdd: RDD[Vector] = filteredRdd.map(document => {
>   Vectors.dense(document.getDouble("SomeValue1"), 
> document.getDouble("SomeValue2"), document.getDouble("SomeValue3"))
> })
> val correlation_matrix = Statistics.corr(vectorRdd, method = "spearman")
> println("Spearman: " + correlation_matrix.toString())
> val correlation_matrix_pearson = Statistics.corr(vectorRdd, method = 
> "pearson")
> println("Pearson: " + correlation_matrix_pearson.toString())
> {code}
> This is the error that gets thrown:
> {code}
> 16/11/23 13:19:51 ERROR ApplicationMaster: User class threw exception: 
> java.lang.RuntimeException: Cannot determine the number of cols because it is 
> not specified in the constructor and the rows RDD is empty.
> java.lang.RuntimeException: Cannot determine the number of cols because it is 
> not specified in the constructor and the rows RDD is empty.
>     at scala.sys.package$.error(package.scala:27)
>     at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.numCols(RowMatrix.scala:64)
>     at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:328)
>     at 
> org.apache.spark.mllib.stat.correlation.PearsonCorrelation$.computeCorrelationMatrix(PearsonCorrelation.scala:49)
>     at 
> org.apache.spark.mllib.stat.correlation.SpearmanCorrelation$.computeCorrelationMatrix(SpearmanCorrelation.scala:91)
>     at 
> org.apache.spark.mllib.stat.correlation.Correlations$.corrMatrix(Correlation.scala:66)
>     at org.apache.spark.mllib.stat.Statistics$.corr(Statistics.scala:74)
> {code}
> Because I filter out the elements which would cause an empty vector, I don't 
> see how this is related to my code? Thus I created this issue. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to