On Mon, 25 Feb 2013, Ferran Jorba wrote:
> we are (slowly) approaching our 1.1 migrations and we'd like to know how
> to proceed, specially because some virtual collections depend on
> understanding this behaviour.

The behaviour you mentioned is caused by fuzzy author name tokenisation
that was introduced in v1.0 already.  In short, an author name is now
indexed in two ways: (i) with a comma, following surname-firstname
order; (ii) without a comma, following firstname-surname order.  This
attempts to mimic user expectations: usually you say `John Doe' in
typical colloquial English or `Doe, John' in the library world.  The
former not using comma, the latter using comma.

This behaviour was introduced so that we could distinguish between
different people almost the same name but with inverted first and family
names.  A classical example in high-energy physics community is `Denis
BERNARD' vs `Bernard DENIS'.  In French, the surname is often printed in
all-caps to help distinguish this.

Now it is true that in some languages (e.g. Hungarian) the usual
convention has the order inverted, e.g. `Petőfi Sándor', where one puts
family name followed by the first name.  This goes contrary to the above
no-Doe-John-without-comma convention.  So we should probably make this
behaviour configurable.

Do you need to distinguish between people typing `Denis BERNARD' and
`Bernard DENIS' in free form in your installations?  That's another
configuration option we can introduce.

                                 * * *

Finally, in order to help you better understand how Invenio tokenises
author names, here is concrete source code example:

In [1]: from invenio.bibindex_engine_tokenizer import \
        BibIndexFuzzyNameTokenizer

In [2]: t = BibIndexFuzzyNameTokenizer()

In [3]: t.tokenize('Doe, John')
Out[3]: ['Doe, J', 'Doe, John', 'J Doe', 'John Doe']

This means that if your record contains `Doe, John' string verbatim, for
the purposes of author index we are creating four variants out of it:

   Doe, J
   Doe, John
   J Doe
   John Doe

You can experiment for yourself to see what we are doing with long
Spanish and/or Portuguese names, for example:

In [4]: t.tokenize('CASULA, Ester Anna Rita')
Out[4]:
['A CASULA',
 'A R CASULA',
 'A Rita CASULA',
 'Anna CASULA',
 'Anna R CASULA',
 'Anna Rita CASULA',
 'CASULA, A',
 'CASULA, A R',
 'CASULA, A Rita',
 'CASULA, Anna',
 'CASULA, Anna R',
 'CASULA, Anna Rita',
 'CASULA, E',
 'CASULA, E A',
 'CASULA, E Anna',
 'CASULA, E Anna Rita',
 'CASULA, E R',
 'CASULA, E Rita',
 'CASULA, Ester',
 'CASULA, Ester A',
 'CASULA, Ester Anna',
 'CASULA, Ester Anna Rita',
 'CASULA, Ester R',
 'CASULA, Ester Rita',
 'CASULA, R',
 'CASULA, Rita',
 'E A CASULA',
 'E Anna CASULA',
 'E Anna Rita CASULA',
 'E CASULA',
 'E R CASULA',
 'E Rita CASULA',
 'Ester A CASULA',
 'Ester Anna CASULA',
 'Ester Anna Rita CASULA',
 'Ester CASULA',
 'Ester R CASULA',
 'Ester Rita CASULA',
 'R CASULA',
 'Rita CASULA']

P.S. See also `exactauthor' index that does not use fuzzy tokeniser but
     attempts to match author names in an exact manner.  It may be
     better suited to your virtual collection definition needs.

Best regards
--
Tibor Simko

Reply via email to