#549: BibMatch: match validation
--------------------------------+--------------------
Reporter: jlavik | Owner: jlavik
Type: enhancement | Status: new
Priority: major | Milestone:
Component: BibMatch | Version:
Keywords: matching, workflow |
--------------------------------+--------------------
Currently, BibMatch will blindly accept search results as exact-matches if
a result returns only one hit. This can cause false positives, which is
not good. In order to produce more reliable results, these matches could
be compared (validated) with the original record to filter out any
possible mis-matches. Such a technique could also be applied to
fuzzy/ambiguous matches to filter out wrong matches, thus reducing the
amount of human interaction needed to approve matches.
This validation process can then involve comparison of record-fields based
on rule-sets defined by users and also based on some detection metrics.
For example, when comparing theses we want to be picky about which author
is present in the first-author field, while some papers have mixed first-
author with co-authors and the comparison methods can be more liberal. In
other cases the title may be very ambiguous and one would want to be extra
forgiving there.
Regular expressions could be used to detect types of records and the
appropriate rule-set would then be applied, complementing/overwriting a
general-purpose rule-set for the specific fields.
The comparison metrics can simply be a threshold for a normalized
Levenshtein distance, but also how much ordering matters and if there is
missing fields. Invenio already has some implementations of record-differs
in BibEdit and BibMerge which might be worth looking at.
As one can begin to imagine, these rule-sets can be quite flexible and as
detailed as one need it to be. The specification of such rule-sets can be
left to super-users, to accommodate their users needs, somewhere in
invenio-local.conf or a standalone configuration file.
--
Ticket URL: <http://invenio-software.org/ticket/549>
Invenio <http://invenio-software.org>