New reply on DataCleaner's online discussion forum 
(http://datacleaner.org/forum):

Kasper Sørensen replied to subject 'Is there a general need to generate deltas 
between two files?'

-------------------

It's funny to see such an old topic come back to live ;-) Since 2011 I have 
been doing similar things in many different ways and my conclusion nowadays 
seems to be "we have the functions, but how you want to use them really 
_depends_".

To walk through some options we have:

 * Using "table lookup" to match two datasets based on IDs (exact matching, not 
always possible).
 * Using the ElasticSearch plugin to index one dataset and then afterwards 
searching for the best matching complete record (takes into account many 
fields, but results may be quite fuzzy and sometimes not what you want).
  * Addition: Using the "EasyDQ additionals" extension to get a Similarity 
score between the original and returned fields. That way you can introduce a 
similarity threshold.
 * Using the Duplicate Detection function that is available in commercial 
editions of DataCleaner.
  * First combine the two (or more) datasets into a single dataset by inserting 
them all into a single store. In the single store make sure to add a field that 
identifies the filename/id of the source set.
  * Run the training mode, but in advanced parameters set it up to not regard 
anything as duplicate when the filename/id field is the same. This will prevent 
you from finding duplicates that occur in just a single set.
  * Proceed otherwise as normal when using the duplicate detection function.

-------------------

View the topic online to reply - go to 
http://datacleaner.org/topic/255/Is-there-a-general-need-to-generate-deltas-between-two-files%3F

-- 
You received this message because you are subscribed to the Google Groups 
"DataCleaner-notify" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/datacleaner-notify.
For more options, visit https://groups.google.com/d/optout.

Reply via email to