On Wed, 27 Feb 2013, Ferran Jorba wrote:
> Ok, now I understand.  This approach, although probably is great for
> persons as authors (say 100 and 700 fields), how is it applied to
> persons as subjects (600 fields)?  This may not be as common in science
> as in humanities, although it exists too.

It is not applied to 600 fields, unless you add them to `author' index.

> I've just seen that Invenio decides to use this tokenizer depending on
> the index name (in bibindex_engine.py):
>
>             if index_name in ('author', 'firstauthor'):
>                 fnc_get_phrases_from_phrase = get_fuzzy_authors_from_phrase

Yes, it is kind of hackish.  Our ultimate goal is to have an easy option
for every site administrators to decide which index uses which
tokeniser, which stopwords list, which stemming rights, etc.  There is a
ticket about centralising these index configurations:

   <http://invenio-software.org/ticket/852>

but we've made not much progress with it yet.

> As our systems come from a long series of updates and migrations, we
> don't have those indexes.  We'll take a look at their definitions to
> see if they are useful to us.

Alternatively, you can use direct MARC queries like:

   100__a:"Doe, John" OR 700__a:"Doe, John"

This would work just like `exactauthor' index queries would.

> BTW, is there a search field option for the ap (alternate patterns) to
> be used in a collection definition?
> (http://invenio-demo.cern.ch/help/hacking/search-engine-api)

webcoll uses mid-level API to calculate which records belong to which
collections.  Namely, search_pattern_parenthesised(), with `ap=-9'.
Currently the `ap' parameter is hard-coded, see calculate_reclist() of
`websearch_webcoll.py'.

But I don't think it would be necessary to make it configurable.  You
would be better off using exact queries in your collection definitions,
such as the one I mentioned above:

   100__a:"Doe, John" OR 700__a:"Doe, John"

Provided you can express your collection definitions in this way, you
will also get an extra bonus of faster ingestion times, as it were.
This is because these queries work right after a record was
bibupload'ed, even when it was not bibindex'ed yet.  So, if bibsched
runs `webcoll' before `bibindex' due to task drift or something, the new
records would be spotted and site updated, no problem.

This is to be contrasted with the situation when your collection is
defined via queries like:

   exactauthor:"Doe, John"

Here, a new record matching this criterion would not be found until
`bibindex' process fully finished to process new records.  This can take
some time.

It is a matter of process chain:

   upload -> index -> webcoll -> record visible

versus:

   upload -> webcoll -> record visible

Depending on how frequently your bibindex/webcoll run and how many
submissions you have per minute, this can make quite a difference.

BTW this is why we are generally preferring to define collections via:

   980__a:THESIS

not via:

   collection:THESIS

I've been digressing a bit, but I hope people may find these titbits
useful.

Best regards
--
Tibor Simko

Reply via email to