New reply on DataCleaner's online discussion forum 
(http://datacleaner.org/forum):

Kasper Sørensen replied to subject 'Record linkage'

-------------------

Hi Jesper,

First of all - talking about linkage, I'm a bit shocked because your name 
(Jesper Lind) is also the name of one of the very earliest contributors to the 
DataCleaner codebase, a guy that I used to study with. So that's a pretty weird 
coincidence, but a "false positive" in terms of matching :-D

Anyways, back on track ... We have various ways in which record linkage may be 
archieved. The main one being our "Duplicate detection" component. This 
component is part of the commercial editions of DataCleaner - see 
[http://datacleaner.org/focus/dedup the page/video about deduplication] for a 5 
minute introduction.

In addition we have a lot of standardization and parsing functionality which 
may be useful in preparing for matching. For instance I am right now working on 
a case where product details needs to be matched, but they are described in 
very different order and style. So we employ various functions such as "Regex 
parser", "Synonym lookup", "Remove dictionary matches" etc. to tidy up the 
product names/descriptions first, so that they can be better matched.

At the point where data is standardized and "tidy" you might want to use the 
Duplicate detection function. Alternatives to this include: You could also 
consider indexing the records in ElasticSearch and then do searches on that, 
which gives you a bit of fuzziness. You might also use a simple Table lookup.

All of DataCleaner can be driven from our public Java API. The job file format 
(.analysis.xml) is also quite straight forward and friendly for editing via 
tools or in hand if you want to do that. And if you have a server edition of 
DataCleaner you also get a web/RESTful API for running jobs, invoking jobs with 
parameterized datasets etc.

-------------------

View the topic online to reply - go to 
http://datacleaner.org/topic/1092/Record-linkage

-- 
You received this message because you are subscribed to the Google Groups 
"DataCleaner-notify" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/datacleaner-notify.
For more options, visit https://groups.google.com/d/optout.

Reply via email to