On Nov 3, 2016, at 10:26 PM, David Adams <[email protected]> wrote:
> 
>> It is slow. However, being able to utilize all 16 logical cores on the
> dual-processor xserve takes it from being “too slow”
>> to “ok with a warning that this is slow”. Even still, it takes 30 seconds
> or so to find possible duplicates in a set of 1000 names.
> 
> What algorithm(s) are you using and against what RDBMS? Several engines
> have some fuzzy comparators baked in. If you could move the fuzzy matches
> onto the server, it might work faster.

I’m using the fuzzy_match gem for ruby, which uses a combination of Pair 
Distance (2-gram) and Levenshtein Edit Distance. 

https://github.com/seamusabshere/fuzzy_match

The database is PostgreSQL, which has a couple options for fuzzy matching as 
well. However, the fuzzy_match gem provides exactly what I needed, while the 
PostgreSQL options would have been a lot more fiddly. With the fuzzy_match gem, 
finding possible duplicates for a single name is nearly instantaneous ( 0.1s), 
it is only finding possible duplicates for hundreds of names that takes time 
(hundreds * nearly instantaneous does eventually add up). 

Jim Crate


> **********************************************************************
> 4D Internet Users Group (4D iNUG)
> FAQ:  http://lists.4d.com/faqnug.html
> Archive:  http://lists.4d.com/archives.html
> Options: http://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:[email protected]
> **********************************************************************

**********************************************************************
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:[email protected]
**********************************************************************

Reply via email to