New reply on DataCleaner's online discussion forum (http://datacleaner.org/forum):
Kasper Sørensen replied to subject 'Record linkage' ------------------- Hi Jesper, First of all - talking about linkage, I'm a bit shocked because your name (Jesper Lind) is also the name of one of the very earliest contributors to the DataCleaner codebase, a guy that I used to study with. So that's a pretty weird coincidence, but a "false positive" in terms of matching :-D Anyways, back on track ... We have various ways in which record linkage may be archieved. The main one being our "Duplicate detection" component. This component is part of the commercial editions of DataCleaner - see [http://datacleaner.org/focus/dedup the page/video about deduplication] for a 5 minute introduction. In addition we have a lot of standardization and parsing functionality which may be useful in preparing for matching. For instance I am right now working on a case where product details needs to be matched, but they are described in very different order and style. So we employ various functions such as "Regex parser", "Synonym lookup", "Remove dictionary matches" etc. to tidy up the product names/descriptions first, so that they can be better matched. At the point where data is standardized and "tidy" you might want to use the Duplicate detection function. Alternatives to this include: You could also consider indexing the records in ElasticSearch and then do searches on that, which gives you a bit of fuzziness. You might also use a simple Table lookup. All of DataCleaner can be driven from our public Java API. The job file format (.analysis.xml) is also quite straight forward and friendly for editing via tools or in hand if you want to do that. And if you have a server edition of DataCleaner you also get a web/RESTful API for running jobs, invoking jobs with parameterized datasets etc. ------------------- View the topic online to reply - go to http://datacleaner.org/topic/1092/Record-linkage -- You received this message because you are subscribed to the Google Groups "DataCleaner-notify" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/datacleaner-notify. For more options, visit https://groups.google.com/d/optout.
