There is also this: https://github.com/soundcloud/cosine-lsh-join-spark
On 02/11/2016 10:12 PM, Brian Morton wrote: > Karl, > > This is tremendously useful. Thanks very much for your insight. > > Brian > > On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley <kmhig...@gmail.com > <mailto:kmhig...@gmail.com>> wrote: > > Hi, > > It sounds like you're trying to solve the approximate nearest > neighbor (ANN) problem. With a large dataset, parallelizing a > brute force O(n^2) approach isn't likely to help all that much, > because the number of pairwise comparisons grows quickly as the > size of the dataset increases. I'd look at ways to avoid computing > the similarity between all pairs, like locality-sensitive hashing. > (Unfortunately Spark doesn't yet support LSH -- it's currently > slated for the Spark 2.0.0 release, but AFAIK development on it > hasn't started yet.) > > There are a bunch of Python libraries that support various > approaches to the ANN problem (including LSH), though. It sounds > like you need fast lookups, so you might check out > https://github.com/spotify/annoy. For other alternatives, see this > performance comparison of Python ANN > libraries: https://github.com/erikbern/ann-benchmarks. > > Hope that helps, > Karl > > On Wed, Feb 10, 2016 at 10:29 PM rokclimb15 <rokclim...@gmail.com > <mailto:rokclim...@gmail.com>> wrote: > > Hi everyone, new to this list and Spark, so I'm hoping someone > can point me > in the right direction. > > I'm trying to perform this same sort of task: > > http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql > > and I'm running into the same problem - it doesn't scale. > Even on a very > fast processor, MySQL pegs out one CPU core at 100% and takes > 8 hours to > find a match with 30 million+ rows. > > What I would like to do is to load this data set from MySQL > into Spark and > compute the Hamming distance using all available cores, then > select the rows > matching a maximum distance. I'm most familiar with Python, > so would prefer > to use that. > > I found an example of loading data from MySQL > > > http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/ > > I found a related DataFrame commit and docs, but I'm not > exactly sure how to > put this all together. > > > https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E > > > http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR > > Could anyone please point me to a similar example I could > follow as a Spark > newb to try this out? Is this even worth attempting, or will > it similarly > fail performance-wise? > > Thanks! > > > > -- > View this message in context: > > http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html > Sent from the Apache Spark User List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > For additional commands, e-mail: user-h...@spark.apache.org > <mailto:user-h...@spark.apache.org> > > -- Maciej Szymkiewicz
signature.asc
Description: OpenPGP digital signature