#556: Journal indexing/ citation extraction RFC
---------------------+-----------------
Reporter: tbrooks | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: BibRank | Version:
Keywords: INSPIRE |
---------------------+-----------------
I went to change something smallish about journal indexing in INSPIRE, and
noticed several things that make sense to me to change. PLease comment
on my small proposal:
Problem: I want to be able to index journal publication information in
multiple ways, and I want the citation indexer to be able to try searching
in multiple formats to find a matching journal. In particular making the
year an optional fourth field in the reference format would be nice.
Motivation: Chris' refextract generates citations of the form p,v,c,y
for INSPIRE. This does not match what bibrank and bibindex are doing,
thus these references are not correctly associated with their papers. The
simple solution is for refextract to use "pubinfo_journal_format" from
bibrank/citation.cfg and thus to drop the year. However, adding the
year, in cases where it can be determined, is a good thing, since there
are journals that have an ambiguous volume number that need the year to
specify a unique paper (JHEP, most notably)
Additionally, it would be nice to be able to consume our own outputs, so
that the form that we write out in the formats p v (y) c or similar,
should also be available as an indexed option.
Finally, note that bibindex contains:
{{{
elif CFG_INSPIRE_SITE:
CFG_JOURNAL_TAG = '773__%'
CFG_JOURNAL_PUBINFO_STANDARD_FORM = "773__p,773__v,773__c"
}}}
Which should really be read from citation.cfg (or vice versa, since these
two must be the same for citation ranking and searching to work)
Proposed Solution:
1) Remove the "journal standard form" from both config files and move it
to top level invenio.conf. (perhaps allow a fallback or override from the
citation.cfg)
2) Allow CFG_JOURNAL_FORM to contain a list of several journal forms i.e.
['p v (y) c', 'p,v,c']
bibindex_engine will index all of the forms as phrases.
bibrank_citation_indexer.ref_analyzer will try all possible forms, in
order, until it finds a unique match for a reference
refextract will use the first form in the list that it has sufficient
information to create.
This should allow INSPIRE to migrate towards a form of references that
includes the year, even though we don't have year information for all
references. More generally it allows the construction of reference
strings from various pieces of a publication, and ensures that the
citation ranking and searching are all in sync with each other from one
config. variable.
Please comment and volunteers to implement are welcome, though probably it
will be Travis...
--
Ticket URL: <http://invenio-software.org/ticket/556>
Invenio <http://invenio-software.org>