On Nov 3, 2016, at 10:26 PM, David Adams <[email protected]> wrote: > >> It is slow. However, being able to utilize all 16 logical cores on the > dual-processor xserve takes it from being “too slow” >> to “ok with a warning that this is slow”. Even still, it takes 30 seconds > or so to find possible duplicates in a set of 1000 names. > > What algorithm(s) are you using and against what RDBMS? Several engines > have some fuzzy comparators baked in. If you could move the fuzzy matches > onto the server, it might work faster.
I’m using the fuzzy_match gem for ruby, which uses a combination of Pair Distance (2-gram) and Levenshtein Edit Distance. https://github.com/seamusabshere/fuzzy_match The database is PostgreSQL, which has a couple options for fuzzy matching as well. However, the fuzzy_match gem provides exactly what I needed, while the PostgreSQL options would have been a lot more fiddly. With the fuzzy_match gem, finding possible duplicates for a single name is nearly instantaneous ( 0.1s), it is only finding possible duplicates for hundreds of names that takes time (hundreds * nearly instantaneous does eventually add up). Jim Crate > ********************************************************************** > 4D Internet Users Group (4D iNUG) > FAQ: http://lists.4d.com/faqnug.html > Archive: http://lists.4d.com/archives.html > Options: http://lists.4d.com/mailman/options/4d_tech > Unsub: mailto:[email protected] > ********************************************************************** ********************************************************************** 4D Internet Users Group (4D iNUG) FAQ: http://lists.4d.com/faqnug.html Archive: http://lists.4d.com/archives.html Options: http://lists.4d.com/mailman/options/4d_tech Unsub: mailto:[email protected] **********************************************************************

