[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14242031#comment-14242031
 ] 

Debasish Das commented on SPARK-4823:
-------------------------------------

I am considering coming up with a baseline version that's very close to brute 
force but we cut the computation with a topK number...for each user come up 
with topK users where K is defined by the client..this will take care of matrix 
factorization use-case...

Basically on master we collect a set of user factors, broadcast it to every 
node and does a reduceByKey to generate topK users for each user from this user 
block...We send a kernel function (cosine / polynomial / rbf) in this 
calculation...

But this idea does not work for raw features right...If we do map features to a 
lower dimension using factorization then this approach should run fine...but I 
am not sure if we can ask users to map their data into a lower dimension

Is it possible to bring in ideas from fastfood and kitchen sink to do this ?


> rowSimilarities
> ---------------
>
>                 Key: SPARK-4823
>                 URL: https://issues.apache.org/jira/browse/SPARK-4823
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to