#183: BibMatch: more permissive fuzzy matching -------------------------+-------------------------------------------------- Reporter: simko | Owner: jlavik Type: enhancement | Status: new Priority: major | Milestone: Component: BibMatch | Version: Keywords: | -------------------------+-------------------------------------------------- BibMatch fuzzy matching should be made more permissive. For example, download MARCXML file for record 92, change `Topological` into `Toological` (leaving out one char to simulate typo), and try to match:
{{{ $ wget -O /tmp/z.xml 'http://pcuds33.cern.ch/record/92/export/xm' $ sed -i 's/Topological/Toological/' /tmp/z.xml $ bibmatch --field=245__a --mode=a < /tmp/z.xml }}} The record is not fuzzy-matched, while it would be good if it were. There are several techniques we can use to help in these cases, e.g. `compare_strings()` from BibMerge's differ; but to make matching efficient, we may need to pre-store some markers for the fields that are most useful for matching, e.g. even simplistic techniques like counting how many times various letters occur there, or something of the kind. P.S. The above CLI sequence can serve as an inspiration for regression/functional tests for this typo-like cases. -- Ticket URL: <http://invenio-software.org/ticket/183> Invenio <http://invenio-software.org>