#556: Journal indexing/ citation extraction RFC
---------------------+-----------------
 Reporter:  tbrooks  |      Owner:
     Type:  defect   |     Status:  new
 Priority:  major    |  Milestone:
Component:  BibRank  |    Version:
 Keywords:  INSPIRE  |
---------------------+-----------------
 I went to change something smallish about journal indexing in INSPIRE, and
 noticed several things that make sense to me to change.   PLease comment
 on my small proposal:
 Problem: I want to be able to index journal publication information in
 multiple ways, and I want the citation indexer to be able to try searching
 in multiple formats to find a matching journal.   In particular making the
 year an optional fourth field in the reference format would be nice.


 Motivation:   Chris' refextract generates citations of the form p,v,c,y
 for INSPIRE.   This does not match what bibrank and bibindex are doing,
 thus these references are not correctly associated with their papers.  The
 simple solution is for refextract to use "pubinfo_journal_format" from
 bibrank/citation.cfg and thus to drop the year.   However, adding the
 year, in cases where it can be determined, is a good thing, since there
 are journals that have an ambiguous volume number that need the year to
 specify a unique paper (JHEP, most notably)

 Additionally, it would be nice to be able to consume our own outputs, so
 that the form that we write out in the formats p v (y) c  or similar,
 should also be available as an indexed option.

 Finally, note that bibindex contains:

 {{{
 elif CFG_INSPIRE_SITE:
     CFG_JOURNAL_TAG = '773__%'
     CFG_JOURNAL_PUBINFO_STANDARD_FORM = "773__p,773__v,773__c"
 }}}


 Which should really be read from citation.cfg (or vice versa, since these
 two must be the same for citation ranking and searching to work)


 Proposed Solution:

 1) Remove the "journal standard form" from both config files and move it
 to top level invenio.conf.  (perhaps allow a fallback or override from the
 citation.cfg)

 2) Allow CFG_JOURNAL_FORM to contain a list of several journal forms i.e.
 ['p v (y) c', 'p,v,c']

 bibindex_engine will index all of the forms as phrases.

 bibrank_citation_indexer.ref_analyzer will try all possible forms, in
 order, until it finds a unique match for a reference

 refextract will use the first form in the list that it has sufficient
 information to create.

 This should allow INSPIRE to migrate towards a form of references that
 includes the year, even though we don't have year information for all
 references.  More generally it allows the construction of reference
 strings from various pieces of a publication, and ensures that the
 citation ranking and searching are all in sync with each other from one
 config. variable.


 Please comment and volunteers to implement are welcome, though probably it
 will be Travis...

-- 
Ticket URL: <http://invenio-software.org/ticket/556>
Invenio <http://invenio-software.org>

Reply via email to