OK. I go to check the DIMSUM implementation in Spark MLlib. The probability a column is sampled is decided by math.sqrt(10 * math.log(nCol) / threshold) / colMagnitude. The most influential parameter is colMagnitude. If in your dataset, the colMagnitude for most columns is very low, then looks like it might not work much better than brute-force even you set a higher threshold.
----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Document-Similarity-Spark-Mllib-tp20196p20226.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org