Hi Sarah,

Eike is on the right track -- the problem your facing isn't a sorting 
problem because ultimately, what you'd want to figure out is the 
strength of the match of any given item to every other item based on 
your threshold (80%). Ideally, what you'd end up with is the original 
list and beside each item are a list of items similar to that item 
(above your threshold). The list itself wouldn't be sorted (except 
perhaps by alphabetical order on the title or something similar) and 
perhaps the list might be filtered by those items that have no matches.

In principle, it's a relatively simple programming challenge -- you'd 
compare the value of each field of an item to the value of each field in 
every other item to generate a similarity score (using an algorithm like 
Levenshtein) and then generate an aggregate similarity score. The tricky 
part is the tuning of the threshold to eliminate false positives while 
minimizing false negatives which is dependent on your data.

Tim

---------------------------------------------
Tim Au Yeung
Manager, Digital Object Repository Technology
Libraries and Cultural Resources
University of Calgary
(403) 220-8975
ytau(at)ucalgary(dot)ca



Eike Friedrich wrote:
> It sounds like what you want to do is probabilistic searching rather than
> boolean 'yes/no' matches. In my experience these are not really built into
> Excel or such like. I had started a project trying to implement this via VBA
> and Macros in MS Access/Excel, but I think although possible this would
> involve much development.
>
> I have come across (but not used) an open-source system called Xapian
> (http://www.xapian.org/features.php) which will allow you to index your
> Excel documents/database in a sophisticated manner, but with lower
> development requirements. The developers of Xapian are somehow connected to
> a company called Lemur Consulting who produce a product called Bamboo which
> is designed to mine data in academic collections. There is a free evaluation
> of Bamboo Personal Edition - contact Lemur Consulting. You may be able to
> use this or something like it to return percentage type results. An
> interesting problem, I hope this offers some help.  
>
> on 11 May 2007 20:37 sarah johnson wrote:
>
>   
>> I recently became involved with a project attempting to sort through
>> a rather large (50,000 or so) collection of multimedia assets
>> (videotapes, audiotapes, cds, dvds), looking for duplicate copies.  I
>> would like to at least get a start on this programmatically, using
>> the database they were recorded in (the database has been exported to
>> an excel file).     
>>
>> However, I am having some trouble coming up with a method to query
>> and sort the data. 
>>
>> Each piece was logged separately, with no mention of whether it was
>> actually a clone or viewing copy of a previously entered piece. 
>>
>> For the most part, all of the information on the label was typed into
>> a single 'description' field.  Runtimes and dates are both in
>> separate fields, as is title (although the separate title field was
>> somewhat rarely used).   
>> Unfortunately, the labels did not always read exactly alike even if
>> the videos were, nor did the layout of the entries always match. 
>> Runtimes are also often off by a few seconds.  
>>
>> I really need a method of automatically pulling up all assets with
>> about an 80% match between fields.  Is there a way to do this?  Or am
>> I approaching this problem all wrong?  Any help at all would be
>> greatly, greatly appreciated!   
>>
>> Sarah Johnson
>> sarah at dvs.com
>>     

Reply via email to