Find duplicated contacts (and typos)
Hi! I have a contact application where I need to display possible duplicates within the existing contacts. Possible duplicates means different contact entries that refer to the same person and might have the same or slightly different information (typos). What I currently do is search for different levels of duplication (it's a single union of 3 queries): - the first query searches for exact duplicates (exactly the same name, address, email, phone, etc); - second query searches for matches using the soundex algorithm on a restricted set of fields and is given a lower matching score; - third query applies soundex on more fields and is given an even lower matching score. Is there a better algorithm or way to do this fuzzy duplication search over multiple fields (firstname, lastname, address, etc) ? Pointers to wikipedia, books, etc appreciated. -- Mack ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338354 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Find duplicated contacts (and typos)
Sadly I don;t think soundex is going to help you as this finds words that sound like other words (there, their, they're), which isn't going to pickup typos. I think comparing each contact field on an OR basis is a sufficient way to find dupes, if none of those fields are the same then it is not really a duplicate. Even if you have a typo in the same, the other fields are going to have a match surely. Russ On Wed, Oct 20, 2010 at 12:53 PM, Mack mrsmith.w...@gmail.com wrote: Hi! I have a contact application where I need to display possible duplicates within the existing contacts. Possible duplicates means different contact entries that refer to the same person and might have the same or slightly different information (typos). What I currently do is search for different levels of duplication (it's a single union of 3 queries): - the first query searches for exact duplicates (exactly the same name, address, email, phone, etc); - second query searches for matches using the soundex algorithm on a restricted set of fields and is given a lower matching score; - third query applies soundex on more fields and is given an even lower matching score. Is there a better algorithm or way to do this fuzzy duplication search over multiple fields (firstname, lastname, address, etc) ? Pointers to wikipedia, books, etc appreciated. -- Mack ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338356 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Find duplicated contacts (and typos)
On Wed, Oct 20, 2010 at 4:22 PM, Russ Michaels r...@michaels.me.uk wrote: Sadly I don;t think soundex is going to help you as this finds words that sound like other words (there, their, they're), which isn't going to pickup typos. In my testing soundex works reasonable well if I allow the 2 values to be within a distance of each other (I'm not searching for exact soundex matching but for the 2 numbers to be close to each other). For example select soundex('Morris'), soundex('Moris'); returns the same value M620 and that is ok for me. I think comparing each contact field on an OR basis is a sufficient way to find dupes, if none of those fields are the same then it is not really a duplicate. Even if you have a typo in the same, the other fields are going to have a match surely. Unfortunately I get very little duplicates that way because of small differences, especially in the address (East Drive vs East Dr.). -- Mack ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338357 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
RE: Find duplicated contacts (and typos)
Crikey, you must just have a really shagged contacts list LOL. Have you tried Plaxo, this does a very good job of removing duplicates. Or does it need to be an in-house app ? Russ -Original Message- From: Mack [mailto:mrsmith.w...@gmail.com] Sent: 20 October 2010 14:43 To: cf-talk Subject: Re: Find duplicated contacts (and typos) On Wed, Oct 20, 2010 at 4:22 PM, Russ Michaels r...@michaels.me.uk wrote: Sadly I don;t think soundex is going to help you as this finds words that sound like other words (there, their, they're), which isn't going to pickup typos. In my testing soundex works reasonable well if I allow the 2 values to be within a distance of each other (I'm not searching for exact soundex matching but for the 2 numbers to be close to each other). For example select soundex('Morris'), soundex('Moris'); returns the same value M620 and that is ok for me. I think comparing each contact field on an OR basis is a sufficient way to find dupes, if none of those fields are the same then it is not really a duplicate. Even if you have a typo in the same, the other fields are going to have a match surely. Unfortunately I get very little duplicates that way because of small differences, especially in the address (East Drive vs East Dr.). -- Mack ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338365 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm
Re: Find duplicated contacts (and typos)
On Wed, Oct 20, 2010 at 7:35 PM, Russ Michaels r...@michaels.me.uk wrote: Crikey, you must just have a really shagged contacts list LOL. Have you tried Plaxo, this does a very good job of removing duplicates. Or does it need to be an in-house app ? Russ, This is part of a larger app and the contacts are coming from different sources (a couple of websites that currently have their own databases with customers and need to be integrated into the intranet app). -- Mack ~| Order the Adobe Coldfusion Anthology now! http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion Archive: http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338389 Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm