#183: BibMatch: more permissive fuzzy matching
-------------------------+--------------------------------------------------
 Reporter:  simko        |       Owner:  jlavik
     Type:  enhancement  |      Status:  new   
 Priority:  major        |   Milestone:        
Component:  BibMatch     |     Version:        
 Keywords:               |  
-------------------------+--------------------------------------------------
 BibMatch fuzzy matching should be made more permissive.  For example,
 download MARCXML file for record 92, change `Topological` into
 `Toological` (leaving out one char to simulate typo), and try to
 match:

 {{{
 $ wget -O /tmp/z.xml 'http://pcuds33.cern.ch/record/92/export/xm'
 $ sed -i 's/Topological/Toological/' /tmp/z.xml
 $ bibmatch --field=245__a --mode=a < /tmp/z.xml
 }}}

 The record is not fuzzy-matched, while it would be good if it were.

 There are several techniques we can use to help in these cases, e.g.
 `compare_strings()` from BibMerge's differ; but to make matching
 efficient, we may need to pre-store some markers for the fields that
 are most useful for matching, e.g. even simplistic techniques like
 counting
 how many times various letters occur there, or something of the kind.

 P.S. The above CLI sequence can serve as an inspiration for
 regression/functional
 tests for this typo-like cases.

-- 
Ticket URL: <http://invenio-software.org/ticket/183>
Invenio <http://invenio-software.org>

Reply via email to