[ 
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547318#comment-14547318
 ] 

Debasish Das commented on SPARK-4823:
-------------------------------------

I opened up a PR that worked well for our datasets. It is still brute-force 
computation although we use blocked cartesian and user defined kernels to 
optimize on cutting computation and shuffle...There are trivial ideas to go 
from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib 
BLAS although I don't think it will give us the runtime boost we are looking 
for...

We are looking into approximate KNN family of algorithms to improve the runtime 
further...KDTree is good for dense vector with low features but for sparse 
vector in higher dimensions researches did not find it useful..

LSH seems to be most commonly used and that's the direction we are looking 
into. I looked into papers but the one that showed good recall values in their 
experiments as compared to brute force KNN is Google Correlate and that's the 
validation strategy we will focus at 
https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other 
references that deem fit. There are twitter papers as well using LSH and the 
implementation is available in algebird. We will start with algebird LSH but 
ideally we don't want to have a distance metric hardcoded in LSH.

If we get good recall using LSH based method compared to the rowSimilarities 
code from the PR, we will use LSH based method to approximate compute 
similarities between dense/sparse rows using cosine kernel,  dense userFactor, 
productFactor from factorization using product kernel and dense user/product 
factor similarities using cosine kernel.

The kernel abstraction is part of the current PR and right now we support 
Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but 
it's not added yet. For approximate row similarity I will open up a separate 
JIRA.

> rowSimilarities
> ---------------
>
>                 Key: SPARK-4823
>                 URL: https://issues.apache.org/jira/browse/SPARK-4823
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between 
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a 
> method, better than brute-forcing it. Note that when there are many rows (> 
> 10^6), it is unlikely that brute-force will be feasible, since the output 
> will be of order 10^12.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to