Find duplicated contacts (and typos)

2010-10-20 Thread Mack

Hi!

I have a contact application where I need to display possible
duplicates within the existing contacts. Possible duplicates means
different contact entries that refer to the same person and might have
the same or slightly different information (typos).

What I currently do is search for different levels of duplication
(it's a single union of 3 queries):
- the first query searches for exact duplicates (exactly the same
name, address, email, phone, etc);
- second query searches for matches using the soundex algorithm on a
restricted set of fields and is given a lower matching score;
- third query applies soundex on more fields and is given an even
lower matching score.

Is there a better algorithm or way to do this fuzzy duplication search
over multiple fields (firstname, lastname, address, etc) ? Pointers to
wikipedia, books, etc appreciated.

-- 
Mack

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338354
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Find duplicated contacts (and typos)

2010-10-20 Thread Russ Michaels

Sadly I don;t think soundex is going to help you as this finds words that
sound like other words (there, their, they're), which isn't going to pickup
typos.
I think comparing each contact field on an OR basis is a sufficient way to
find dupes, if none of those fields are the same then it is not really a
duplicate. Even if you have a typo in the same, the other fields are going
to have a match surely.

Russ


On Wed, Oct 20, 2010 at 12:53 PM, Mack mrsmith.w...@gmail.com wrote:


 Hi!

 I have a contact application where I need to display possible
 duplicates within the existing contacts. Possible duplicates means
 different contact entries that refer to the same person and might have
 the same or slightly different information (typos).

 What I currently do is search for different levels of duplication
 (it's a single union of 3 queries):
 - the first query searches for exact duplicates (exactly the same
 name, address, email, phone, etc);
 - second query searches for matches using the soundex algorithm on a
 restricted set of fields and is given a lower matching score;
 - third query applies soundex on more fields and is given an even
 lower matching score.

 Is there a better algorithm or way to do this fuzzy duplication search
 over multiple fields (firstname, lastname, address, etc) ? Pointers to
 wikipedia, books, etc appreciated.

 --
 Mack

 

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338356
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Find duplicated contacts (and typos)

2010-10-20 Thread Mack

On Wed, Oct 20, 2010 at 4:22 PM, Russ Michaels r...@michaels.me.uk wrote:

 Sadly I don;t think soundex is going to help you as this finds words that
 sound like other words (there, their, they're), which isn't going to pickup
 typos.

In my testing soundex works reasonable well if I allow the 2 values to
be within a distance of each other (I'm not searching for exact
soundex matching but for the 2 numbers to be close to each other).

For example select soundex('Morris'), soundex('Moris'); returns the
same value M620 and that is ok for me.

 I think comparing each contact field on an OR basis is a sufficient way to
 find dupes, if none of those fields are the same then it is not really a
 duplicate. Even if you have a typo in the same, the other fields are going
 to have a match surely.

Unfortunately I get very little duplicates that way because of small
differences, especially in the address (East Drive vs East Dr.).

-- 
Mack

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338357
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


RE: Find duplicated contacts (and typos)

2010-10-20 Thread Russ Michaels

Crikey, you must just have a really shagged contacts list LOL.
Have you tried Plaxo, this does a very good job of removing duplicates.
Or does it need to be an in-house app ?

Russ

-Original Message-
From: Mack [mailto:mrsmith.w...@gmail.com] 
Sent: 20 October 2010 14:43
To: cf-talk
Subject: Re: Find duplicated contacts (and typos)


On Wed, Oct 20, 2010 at 4:22 PM, Russ Michaels r...@michaels.me.uk wrote:

 Sadly I don;t think soundex is going to help you as this finds words 
 that sound like other words (there, their, they're), which isn't going 
 to pickup typos.

In my testing soundex works reasonable well if I allow the 2 values to be
within a distance of each other (I'm not searching for exact soundex
matching but for the 2 numbers to be close to each other).

For example select soundex('Morris'), soundex('Moris'); returns the same
value M620 and that is ok for me.

 I think comparing each contact field on an OR basis is a sufficient 
 way to find dupes, if none of those fields are the same then it is not 
 really a duplicate. Even if you have a typo in the same, the other 
 fields are going to have a match surely.

Unfortunately I get very little duplicates that way because of small
differences, especially in the address (East Drive vs East Dr.).

--
Mack



~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338365
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm


Re: Find duplicated contacts (and typos)

2010-10-20 Thread Mack

On Wed, Oct 20, 2010 at 7:35 PM, Russ Michaels r...@michaels.me.uk wrote:

 Crikey, you must just have a really shagged contacts list LOL.
 Have you tried Plaxo, this does a very good job of removing duplicates.
 Or does it need to be an in-house app ?

Russ,

This is part of a larger app and the contacts are coming from
different sources (a couple of websites that currently have their own
databases with customers and need to be integrated into the intranet
app).

-- 
Mack

~|
Order the Adobe Coldfusion Anthology now!
http://www.amazon.com/Adobe-Coldfusion-Anthology/dp/1430272155/?tag=houseoffusion
Archive: 
http://www.houseoffusion.com/groups/cf-talk/message.cfm/messageid:338389
Subscription: http://www.houseoffusion.com/groups/cf-talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/groups/cf-talk/unsubscribe.cfm