New reply on DataCleaner's online discussion forum (http://datacleaner.org/forum):
Kasper Sørensen replied to subject 'Is there a general need to generate deltas between two files?' ------------------- It's funny to see such an old topic come back to live ;-) Since 2011 I have been doing similar things in many different ways and my conclusion nowadays seems to be "we have the functions, but how you want to use them really _depends_". To walk through some options we have: * Using "table lookup" to match two datasets based on IDs (exact matching, not always possible). * Using the ElasticSearch plugin to index one dataset and then afterwards searching for the best matching complete record (takes into account many fields, but results may be quite fuzzy and sometimes not what you want). * Addition: Using the "EasyDQ additionals" extension to get a Similarity score between the original and returned fields. That way you can introduce a similarity threshold. * Using the Duplicate Detection function that is available in commercial editions of DataCleaner. * First combine the two (or more) datasets into a single dataset by inserting them all into a single store. In the single store make sure to add a field that identifies the filename/id of the source set. * Run the training mode, but in advanced parameters set it up to not regard anything as duplicate when the filename/id field is the same. This will prevent you from finding duplicates that occur in just a single set. * Proceed otherwise as normal when using the duplicate detection function. ------------------- View the topic online to reply - go to http://datacleaner.org/topic/255/Is-there-a-general-need-to-generate-deltas-between-two-files%3F -- You received this message because you are subscribed to the Google Groups "DataCleaner-notify" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/datacleaner-notify. For more options, visit https://groups.google.com/d/optout.
