#183: BibMatch: more permissive fuzzy matching
-------------------------+--------------------------------------------------
Reporter: simko | Owner: jlavik
Type: enhancement | Status: new
Priority: major | Milestone:
Component: BibMatch | Version:
Keywords: |
-------------------------+--------------------------------------------------
BibMatch fuzzy matching should be made more permissive. For example,
download MARCXML file for record 92, change `Topological` into
`Toological` (leaving out one char to simulate typo), and try to
match:
{{{
$ wget -O /tmp/z.xml 'http://pcuds33.cern.ch/record/92/export/xm'
$ sed -i 's/Topological/Toological/' /tmp/z.xml
$ bibmatch --field=245__a --mode=a < /tmp/z.xml
}}}
The record is not fuzzy-matched, while it would be good if it were.
There are several techniques we can use to help in these cases, e.g.
`compare_strings()` from BibMerge's differ; but to make matching
efficient, we may need to pre-store some markers for the fields that
are most useful for matching, e.g. even simplistic techniques like
counting
how many times various letters occur there, or something of the kind.
P.S. The above CLI sequence can serve as an inspiration for
regression/functional
tests for this typo-like cases.
--
Ticket URL: <http://invenio-software.org/ticket/183>
Invenio <http://invenio-software.org>