Re: Document Similarity -Spark Mllib

satyajit vegesna Tue, 13 Dec 2016 14:10:47 -0800

Hi Liang,

The problem is that when i take a huge data set , i get a matrix size
1616160 * 1616160.


PFB code,

 val exact = mat.columnSimilarities(0.5)
 val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i,
j), u) }
case class output(label1:Long,label2:Long,score:Double)
val fin = exactEntries.map(x => output(x._1._1,x._1._2,x._2)).toDF
val fin2 = fin.persist(StorageLevel.MEMORY_AND_DISK_SER)

finally when i try to write the data into parquet from
fin2.(fin2.write.parquet("/somelocation"))

it takes forever and i do not see any progress.

But the same code works good with smaller dataset.

Any suggestion on how to deal with the above situation , is highly
appreciated.

Regards,
Satyajit.

On Sat, Dec 10, 2016 at 3:44 AM, Liang-Chi Hsieh <vii...@gmail.com> wrote:

> Hi Satyajit,
>
> I am not sure why you think DIMSUM cannot apply for your use case. Or
> you've
> tried it but encountered some problems.
>
> Although in the paper[1] the authors mentioned they concentrate on the
> regime where the number of rows is very large, and the number of columns is
> not too large. But I think it doesn't prevent you applying it on the
> dataset
> of large columns. By the way, in another paper[2], they experimented it on
> a
> dataset of 10^7 columns.
>
> Even the number of column is very large, if your dataset is very sparse,
> and
> you use SparseVector, DIMSUM should work well too. You can also adjust the
> threshold when using DIMSUM.
>
>
> [1] Reza Bosagh Zadeh and Gunnar Carlsson, "Dimension Independent Matrix
> Square using MapReduce (DIMSUM)"
> [2] Reza Bosagh Zadeh and Ashish Goel, "Dimension Independent Similarity
> Computation"
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib-
> tp20196p20198.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Document Similarity -Spark Mllib

Reply via email to