[
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242031#comment-14242031
]
Debasish Das commented on SPARK-4823:
-------------------------------------
I am considering coming up with a baseline version that's very close to brute
force but we cut the computation with a topK number...for each user come up
with topK users where K is defined by the client..this will take care of matrix
factorization use-case...
Basically on master we collect a set of user factors, broadcast it to every
node and does a reduceByKey to generate topK users for each user from this user
block...We send a kernel function (cosine / polynomial / rbf) in this
calculation...
But this idea does not work for raw features right...If we do map features to a
lower dimension using factorization then this approach should run fine...but I
am not sure if we can ask users to map their data into a lower dimension
Is it possible to bring in ideas from fastfood and kitchen sink to do this ?
> rowSimilarities
> ---------------
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a
> method, better than brute-forcing it. Note that when there are many rows (>
> 10^6), it is unlikely that brute-force will be feasible, since the output
> will be of order 10^12.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]