#549: BibMatch: match validation
--------------------------------+--------------------
 Reporter:  jlavik              |      Owner:  jlavik
     Type:  enhancement         |     Status:  new
 Priority:  major               |  Milestone:
Component:  BibMatch            |    Version:
 Keywords:  matching, workflow  |
--------------------------------+--------------------
 Currently, BibMatch will blindly accept search results as exact-matches if
 a result returns only one hit. This can cause false positives, which is
 not good. In order to produce more reliable results, these matches could
 be compared (validated) with the original record to filter out any
 possible mis-matches. Such a technique could also be applied to
 fuzzy/ambiguous matches to filter out wrong matches, thus reducing the
 amount of human interaction needed to approve matches.

 This validation process can then involve comparison of record-fields based
 on rule-sets defined by users and also based on some detection metrics.
 For example, when comparing theses we want to be picky about which author
 is present in the first-author field, while some papers have mixed first-
 author with co-authors and the comparison methods can be more liberal. In
 other cases the title may be very ambiguous and one would want to be extra
 forgiving there.

 Regular expressions could be used to detect types of records and the
 appropriate rule-set would then be applied, complementing/overwriting a
 general-purpose rule-set for the specific fields.

 The comparison metrics can simply be a threshold for a normalized
 Levenshtein distance, but also how much ordering matters and if there is
 missing fields. Invenio already has some implementations of record-differs
 in BibEdit and BibMerge which might be worth looking at.

 As one can begin to imagine, these rule-sets can be quite flexible and as
 detailed as one need it to be. The specification of such rule-sets can be
 left to super-users, to accommodate their users needs, somewhere in
 invenio-local.conf or a standalone configuration file.

-- 
Ticket URL: <http://invenio-software.org/ticket/549>
Invenio <http://invenio-software.org>

Reply via email to