Hi Liang, The problem is that when i take a huge data set , i get a matrix size 1616160 * 1616160.
PFB code, val exact = mat.columnSimilarities(0.5) val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) } case class output(label1:Long,label2:Long,score:Double) val fin = exactEntries.map(x => output(x._1._1,x._1._2,x._2)).toDF val fin2 = fin.persist(StorageLevel.MEMORY_AND_DISK_SER) finally when i try to write the data into parquet from fin2.(fin2.write.parquet("/somelocation")) it takes forever and i do not see any progress. But the same code works good with smaller dataset. Any suggestion on how to deal with the above situation , is highly appreciated. Regards, Satyajit. On Sat, Dec 10, 2016 at 3:44 AM, Liang-Chi Hsieh <vii...@gmail.com> wrote: > Hi Satyajit, > > I am not sure why you think DIMSUM cannot apply for your use case. Or > you've > tried it but encountered some problems. > > Although in the paper[1] the authors mentioned they concentrate on the > regime where the number of rows is very large, and the number of columns is > not too large. But I think it doesn't prevent you applying it on the > dataset > of large columns. By the way, in another paper[2], they experimented it on > a > dataset of 10^7 columns. > > Even the number of column is very large, if your dataset is very sparse, > and > you use SparseVector, DIMSUM should work well too. You can also adjust the > threshold when using DIMSUM. > > > [1] Reza Bosagh Zadeh and Gunnar Carlsson, "Dimension Independent Matrix > Square using MapReduce (DIMSUM)" > [2] Reza Bosagh Zadeh and Ashish Goel, "Dimension Independent Similarity > Computation" > > > > > ----- > Liang-Chi Hsieh | @viirya > Spark Technology Center > -- > View this message in context: http://apache-spark- > developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib- > tp20196p20198.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >