Hi Sarah, Eike is on the right track -- the problem your facing isn't a sorting problem because ultimately, what you'd want to figure out is the strength of the match of any given item to every other item based on your threshold (80%). Ideally, what you'd end up with is the original list and beside each item are a list of items similar to that item (above your threshold). The list itself wouldn't be sorted (except perhaps by alphabetical order on the title or something similar) and perhaps the list might be filtered by those items that have no matches.
In principle, it's a relatively simple programming challenge -- you'd compare the value of each field of an item to the value of each field in every other item to generate a similarity score (using an algorithm like Levenshtein) and then generate an aggregate similarity score. The tricky part is the tuning of the threshold to eliminate false positives while minimizing false negatives which is dependent on your data. Tim --------------------------------------------- Tim Au Yeung Manager, Digital Object Repository Technology Libraries and Cultural Resources University of Calgary (403) 220-8975 ytau(at)ucalgary(dot)ca Eike Friedrich wrote: > It sounds like what you want to do is probabilistic searching rather than > boolean 'yes/no' matches. In my experience these are not really built into > Excel or such like. I had started a project trying to implement this via VBA > and Macros in MS Access/Excel, but I think although possible this would > involve much development. > > I have come across (but not used) an open-source system called Xapian > (http://www.xapian.org/features.php) which will allow you to index your > Excel documents/database in a sophisticated manner, but with lower > development requirements. The developers of Xapian are somehow connected to > a company called Lemur Consulting who produce a product called Bamboo which > is designed to mine data in academic collections. There is a free evaluation > of Bamboo Personal Edition - contact Lemur Consulting. You may be able to > use this or something like it to return percentage type results. An > interesting problem, I hope this offers some help. > > on 11 May 2007 20:37 sarah johnson wrote: > > >> I recently became involved with a project attempting to sort through >> a rather large (50,000 or so) collection of multimedia assets >> (videotapes, audiotapes, cds, dvds), looking for duplicate copies. I >> would like to at least get a start on this programmatically, using >> the database they were recorded in (the database has been exported to >> an excel file). >> >> However, I am having some trouble coming up with a method to query >> and sort the data. >> >> Each piece was logged separately, with no mention of whether it was >> actually a clone or viewing copy of a previously entered piece. >> >> For the most part, all of the information on the label was typed into >> a single 'description' field. Runtimes and dates are both in >> separate fields, as is title (although the separate title field was >> somewhat rarely used). >> Unfortunately, the labels did not always read exactly alike even if >> the videos were, nor did the layout of the entries always match. >> Runtimes are also often off by a few seconds. >> >> I really need a method of automatically pulling up all assets with >> about an 80% match between fields. Is there a way to do this? Or am >> I approaching this problem all wrong? Any help at all would be >> greatly, greatly appreciated! >> >> Sarah Johnson >> sarah at dvs.com >>
