Re: Computing hamming distance over large data set

Maciej Szymkiewicz Fri, 12 Feb 2016 17:57:44 -0800

There is also this: https://github.com/soundcloud/cosine-lsh-join-spark


On 02/11/2016 10:12 PM, Brian Morton wrote:
> Karl,
>
> This is tremendously useful.  Thanks very much for your insight.
>
> Brian
>
> On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley <kmhig...@gmail.com
> <mailto:kmhig...@gmail.com>> wrote:
>
>     Hi,
>
>     It sounds like you're trying to solve the approximate nearest
>     neighbor (ANN) problem. With a large dataset, parallelizing a
>     brute force O(n^2) approach isn't likely to help all that much,
>     because the number of pairwise comparisons grows quickly as the
>     size of the dataset increases. I'd look at ways to avoid computing
>     the similarity between all pairs, like locality-sensitive hashing.
>     (Unfortunately Spark doesn't yet support LSH -- it's currently
>     slated for the Spark 2.0.0 release, but AFAIK development on it
>     hasn't started yet.)
>
>     There are a bunch of Python libraries that support various
>     approaches to the ANN problem (including LSH), though. It sounds
>     like you need fast lookups, so you might check out
>     https://github.com/spotify/annoy. For other alternatives, see this
>     performance comparison of Python ANN
>     libraries: https://github.com/erikbern/ann-benchmarks.
>
>     Hope that helps,
>     Karl
>
>     On Wed, Feb 10, 2016 at 10:29 PM rokclimb15 <rokclim...@gmail.com
>     <mailto:rokclim...@gmail.com>> wrote:
>
>         Hi everyone, new to this list and Spark, so I'm hoping someone
>         can point me
>         in the right direction.
>
>         I'm trying to perform this same sort of task:
>         
> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>
>         and I'm running into the same problem - it doesn't scale. 
>         Even on a very
>         fast processor, MySQL pegs out one CPU core at 100% and takes
>         8 hours to
>         find a match with 30 million+ rows.
>
>         What I would like to do is to load this data set from MySQL
>         into Spark and
>         compute the Hamming distance using all available cores, then
>         select the rows
>         matching a maximum distance.  I'm most familiar with Python,
>         so would prefer
>         to use that.
>
>         I found an example of loading data from MySQL
>
>         
> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>
>         I found a related DataFrame commit and docs, but I'm not
>         exactly sure how to
>         put this all together.
>
>         
> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>
>         
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>
>         Could anyone please point me to a similar example I could
>         follow as a Spark
>         newb to try this out?  Is this even worth attempting, or will
>         it similarly
>         fail performance-wise?
>
>         Thanks!
>
>
>
>         --
>         View this message in context:
>         
> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
>         Sent from the Apache Spark User List mailing list archive at
>         Nabble.com.
>
>         ---------------------------------------------------------------------
>         To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>         <mailto:user-unsubscr...@spark.apache.org>
>         For additional commands, e-mail: user-h...@spark.apache.org
>         <mailto:user-h...@spark.apache.org>
>
>

-- 
Maciej Szymkiewicz

signature.asc
Description: OpenPGP digital signature

Re: Computing hamming distance over large data set

Reply via email to