Hey,
First time writing in. I am in the midst the setup of a small cluster for testing (using Cloudera's CDH 3 on Fedora 14) and I trying to work out a good model for a use case which just came up. I have around 6 million active records in a contacts database. Additional millions of history address records for these records. I got a new 60+ thousand records which are not correlated to these that I need to fuzzy match against both active and historical records. It starts there, but I will need to do the same thing with the database against itself for de-duplication. The data is primarily in Oracle (with the supplement in csv's). I saw the Booz/Allen/Hamilton presentation on fuzzy matching - but I don't see any distributions for that implementation. At the same time I don't need real time now, I need batch. Mahout might be the way to go, but I think I'm re-inventing at least a wheel or two. Any comments appreciated. Best, James
