Re: Document Similarity -Spark Mllib

Liang-Chi Hsieh Thu, 15 Dec 2016 01:06:10 -0800

OK. I go to check the DIMSUM implementation in Spark MLlib. The probability
a column is sampled is decided by math.sqrt(10 * math.log(nCol) / threshold)
/ colMagnitude. The most influential parameter is colMagnitude. If in your
dataset, the colMagnitude for most columns is very low, then looks like it
might not work much better than brute-force even you set a higher threshold.






-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib-tp20196p20226.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Document Similarity -Spark Mllib

Reply via email to