I ran across DIMSUM a while ago but never used it.

https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html

Annoy is wonderful if you want to make queries.

If you want to do the "self similarity join" you might look at DIMSUM or
preferably if at all possible see if there's some key that you can join
possible pairs and then use a similarity metric to filter out non matches.
Does that make sense? In general way more efficient then computing n^2
similarities.

Hth

Charlie
On Fri, Feb 12, 2016 at 20:57 Maciej Szymkiewicz <mszymkiew...@gmail.com>
wrote:

> There is also this: https://github.com/soundcloud/cosine-lsh-join-spark
>
>
> On 02/11/2016 10:12 PM, Brian Morton wrote:
>
> Karl,
>
> This is tremendously useful.  Thanks very much for your insight.
>
> Brian
>
> On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley <kmhig...@gmail.com> wrote:
>
>> Hi,
>>
>> It sounds like you're trying to solve the approximate nearest neighbor
>> (ANN) problem. With a large dataset, parallelizing a brute force O(n^2)
>> approach isn't likely to help all that much, because the number of pairwise
>> comparisons grows quickly as the size of the dataset increases. I'd look at
>> ways to avoid computing the similarity between all pairs, like
>> locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH --
>> it's currently slated for the Spark 2.0.0 release, but AFAIK development on
>> it hasn't started yet.)
>>
>> There are a bunch of Python libraries that support various approaches to
>> the ANN problem (including LSH), though. It sounds like you need fast
>> lookups, so you might check out https://github.com/spotify/annoy. For
>> other alternatives, see this performance comparison of Python ANN libraries
>> : https://github.com/erikbern/ann-benchmarks.
>>
>> Hope that helps,
>> Karl
>>
>> On Wed, Feb 10, 2016 at 10:29 PM rokclimb15 <rokclim...@gmail.com> wrote:
>>
>>> Hi everyone, new to this list and Spark, so I'm hoping someone can point
>>> me
>>> in the right direction.
>>>
>>> I'm trying to perform this same sort of task:
>>>
>>> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>>>
>>> and I'm running into the same problem - it doesn't scale.  Even on a very
>>> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to
>>> find a match with 30 million+ rows.
>>>
>>> What I would like to do is to load this data set from MySQL into Spark
>>> and
>>> compute the Hamming distance using all available cores, then select the
>>> rows
>>> matching a maximum distance.  I'm most familiar with Python, so would
>>> prefer
>>> to use that.
>>>
>>> I found an example of loading data from MySQL
>>>
>>>
>>> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>>>
>>> I found a related DataFrame commit and docs, but I'm not exactly sure
>>> how to
>>> put this all together.
>>>
>>>
>>> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>>>
>>>
>>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>>>
>>> Could anyone please point me to a similar example I could follow as a
>>> Spark
>>> newb to try this out?  Is this even worth attempting, or will it
>>> similarly
>>> fail performance-wise?
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
> --
> Maciej Szymkiewicz
>
>

Reply via email to