Github user rezazadeh commented on a diff in the pull request:
https://github.com/apache/spark/pull/1778#discussion_r17818651
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
---
@@ -390,6 +393,113 @@ class RowMatrix(
new RowMatrix(AB, nRows, B.numCols)
}
+ /**
+ * Compute all cosine similarities between columns of this matrix using
the brute-force
+ * approach of computing normalized dot products.
+ *
+ * @return An n x n sparse upper-triangular matrix of cosine
similarities between
+ * columns of this matrix.
+ */
+ def columnSimilarities(): CoordinateMatrix = {
+ similarColumns(0.0)
+ }
+
+ /**
+ * Compute all similarities between columns of this matrix using a
sampling approach.
+ *
+ * The threshold parameter is a trade-off knob between estimate quality
and computational cost.
+ *
+ * Setting a threshold of 0 guarantees deterministic correct results,
but comes at exactly
+ * the same cost as the brute-force approach. Setting the threshold to
positive values
+ * incurs strictly less computational cost than the brute-force
approach, however the
+ * similarities computed will be estimates.
+ *
+ * The sampling guarantees relative-error correctness for those pairs of
columns that have
+ * similarity greater than the given similarity threshold.
+ *
+ * To describe the guarantee, we set some notation:
+ * Let A be the smallest in magnitude non-zero element of this matrix.
+ * Let B be the largest in magnitude non-zero element of this matrix.
+ * Let L be the number of non-zeros per row.
--- End diff --
Max. Added that.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]