Basically, I have two databases containing lists of postal addresses and need to look for matching addresses in the two databases. More precisely, for each address in database A I want to find a single matching address in database B.
I'm 90% of the way there, in the sense that I have a simplistic approach that matches 90% of the addresses in database A. But the extra cases could be a pain to deal with!
It's probably not relevant, but I'm using ZODB to store the databases.
The current approach is to loop over addresses in database A. I then identify all addresses in database B that share the same postal code (typically less than 50). The database has a mapping that lets me do this efficiently. Then I look for 'good' matches. If there is exactly one I declare a success. This isn't as efficient as it could be, it's O(n^2) for each postcode, because I end up comparing all possible pairs. But it's fast enough for my application.
The problem is looking for good matches. I currently normalise the addresses to ignore some irrelevant issues like case and punctuation, but there are other issues.
Here are just some examples where the software didn't declare a match:
1 Brantwood, BEAMINSTER, DORSET, DT8 3SS THE BEECHES 1, BRANTWOOD, BEAMINSTER, DORSET DT8 3SS
Flat 2, Bethany House, Broadwindsor Road, BEAMINSTER, DORSET, DT8 3PP 2, BETHANY HOUSE, BEAMINSTER, DORSET DT8 3PP
Penthouse,Old Vicarage, 1 Clay Lane, BEAMINSTER, DORSET, DT8 3BU PENTHOUSE FLAT THE OLD VICARAGE 1, CLAY LANE, BEAMINSTER, DORSET DT8 3BU
St John's Presbytery, Shortmoor, BEAMINSTER, DORSET, DT8 3EL THE PRESBYTERY, SHORTMOOR, BEAMINSTER, DORSET DT8 3EL
The Pinnacles, White Sheet Hill, BEAMINSTER, DORSET, DT8 3SF PINNACLES, WHITESHEET HILL, BEAMINSTER, DORSET DT8 3SF
The challenge is to fix some of the false negatives above without introducing false positives!
Any pointers gratefully received.
-- Andrew McLean -- http://mail.python.org/mailman/listinfo/python-list