Hello Tibor,
Tibor Simko <[email protected]> wrote:
>
> On Mon, 25 Feb 2013, Ferran Jorba wrote:
>> we are (slowly) approaching our 1.1 migrations and we'd like to know how
>> to proceed, specially because some virtual collections depend on
>> understanding this behaviour.
>
> The behaviour you mentioned is caused by fuzzy author name tokenisation
> that was introduced in v1.0 already. In short, an author name is now
> indexed in two ways: (i) with a comma, following surname-firstname
> order; (ii) without a comma, following firstname-surname order. This
> attempts to mimic user expectations: usually you say `John Doe' in
> typical colloquial English or `Doe, John' in the library world. The
> former not using comma, the latter using comma.
[long and detailed explanation omitted; thanks a lot!]
Ok, now I understand. This approach, although probably is great for
persons as authors (say 100 and 700 fields), how is it applied to
persons as subjects (600 fields)? This may not be as common in science
as in humanities, although it exists too.
I've just seen that Invenio decides to use this tokenizer depending on
the index name (in bibindex_engine.py):
if index_name in ('author', 'firstauthor'):
fnc_get_phrases_from_phrase = get_fuzzy_authors_from_phrase
elif index_name in ('exactauthor', 'exactfirstauthor'):
fnc_get_phrases_from_phrase = get_exact_authors_from_phrase
else:
fnc_get_phrases_from_phrase = get_phrases_from_phrase
In Traces, our Catalan Language and Literature Database
(http://traces.uab.cat/), the admins routinely create virtual
collections using author names with the goal to collect records of those
persons both as authors and subjects, with the name of the person
enclosed in doble quotes, like "Last, First", but without specifying any
specific field. We'll review those definitions under this new
understanding.
> P.S. See also `exactauthor' index that does not use fuzzy tokeniser but
> attempts to match author names in an exact manner. It may be
> better suited to your virtual collection definition needs.
As our systems come from a long series of updates and migrations, we
don't have those indexes. We'll take a look at their definitions to see
if they are useful to us.
BTW, is there a search field option for the ap (alternate patterns) to
be used in a collection definition?
(http://invenio-demo.cern.ch/help/hacking/search-engine-api)
Thanks again,
Ferran