Hi Liang,
The problem is that when i take a huge data set , i get a matrix size
1616160 * 1616160.
PFB code,
val exact = mat.columnSimilarities(0.5)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i,
j), u) }
case class output(label1:Long,label2:Long,score:Double)
val fin = exactEntries.map(x => output(x._1._1,x._1._2,x._2)).toDF
val fin2 = fin.persist(StorageLevel.MEMORY_AND_DISK_SER)
finally when i try to write the data into parquet from
fin2.(fin2.write.parquet("/somelocation"))
it takes forever and i do not see any progress.
But the same code works good with smaller dataset.
Any suggestion on how to deal with the above situation , is highly
appreciated.
Regards,
Satyajit.
On Sat, Dec 10, 2016 at 3:44 AM, Liang-Chi Hsieh <[email protected]> wrote:
> Hi Satyajit,
>
> I am not sure why you think DIMSUM cannot apply for your use case. Or
> you've
> tried it but encountered some problems.
>
> Although in the paper[1] the authors mentioned they concentrate on the
> regime where the number of rows is very large, and the number of columns is
> not too large. But I think it doesn't prevent you applying it on the
> dataset
> of large columns. By the way, in another paper[2], they experimented it on
> a
> dataset of 10^7 columns.
>
> Even the number of column is very large, if your dataset is very sparse,
> and
> you use SparseVector, DIMSUM should work well too. You can also adjust the
> threshold when using DIMSUM.
>
>
> [1] Reza Bosagh Zadeh and Gunnar Carlsson, "Dimension Independent Matrix
> Square using MapReduce (DIMSUM)"
> [2] Reza Bosagh Zadeh and Ashish Goel, "Dimension Independent Similarity
> Computation"
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib-
> tp20196p20198.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>