Fuzzy matching

James Pettyjohn Thu, 28 Apr 2011 23:50:43 -0700


Hey,


First time writing in. 

I am in the midst the setup of a small
cluster for testing (using Cloudera's CDH 3 on Fedora 14) and I trying to
work out a good model for a use case which just came up. 

I have around 6
million active records in a contacts database. Additional millions of
history address records for these records. I got a new 60+ thousand records
which are not correlated to these that I need to fuzzy match against both
active and historical records. 

It starts there, but I will need to do the
same thing with the database against itself for de-duplication. The data is
primarily in Oracle (with the supplement in csv's). 

I saw the
Booz/Allen/Hamilton presentation on fuzzy matching - but I don't see any
distributions for that implementation. At the same time I don't need real
time now, I need batch. 

Mahout might be the way to go, but I think I'm
re-inventing at least a wheel or two. 

Any comments appreciated. 

Best,
James

Fuzzy matching

Reply via email to