[
https://issues.apache.org/jira/browse/SPARK-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547318#comment-14547318
]
Debasish Das commented on SPARK-4823:
-------------------------------------
I opened up a PR that worked well for our datasets. It is still brute-force
computation although we use blocked cartesian and user defined kernels to
optimize on cutting computation and shuffle...There are trivial ideas to go
from BLAS-1 to BLAS-2 and BLAS-3 as more sparse operations are added to mllib
BLAS although I don't think it will give us the runtime boost we are looking
for...
We are looking into approximate KNN family of algorithms to improve the runtime
further...KDTree is good for dense vector with low features but for sparse
vector in higher dimensions researches did not find it useful..
LSH seems to be most commonly used and that's the direction we are looking
into. I looked into papers but the one that showed good recall values in their
experiments as compared to brute force KNN is Google Correlate and that's the
validation strategy we will focus at
https://www.google.com/trends/correlate/nnsearch.pdf. Please point to any other
references that deem fit. There are twitter papers as well using LSH and the
implementation is available in algebird. We will start with algebird LSH but
ideally we don't want to have a distance metric hardcoded in LSH.
If we get good recall using LSH based method compared to the rowSimilarities
code from the PR, we will use LSH based method to approximate compute
similarities between dense/sparse rows using cosine kernel, dense userFactor,
productFactor from factorization using product kernel and dense user/product
factor similarities using cosine kernel.
The kernel abstraction is part of the current PR and right now we support
Cosine, Product, Euclidean and RBF. Pearson is something that's of interest but
it's not added yet. For approximate row similarity I will open up a separate
JIRA.
> rowSimilarities
> ---------------
>
> Key: SPARK-4823
> URL: https://issues.apache.org/jira/browse/SPARK-4823
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Reza Zadeh
>
> RowMatrix has a columnSimilarities method to find cosine similarities between
> columns.
> A rowSimilarities method would be useful to find similarities between rows.
> This is JIRA is to investigate which algorithms are suitable for such a
> method, better than brute-forcing it. Note that when there are many rows (>
> 10^6), it is unlikely that brute-force will be feasible, since the output
> will be of order 10^12.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]