On Wed, 26 Feb 2014, Alexander Wagner wrote:
> I think this would solve the issue, indeed. I was not aware that I can
> hook up a specific tokenizer to an index. I see in our 1.0 that
> there's some magic happening with authors, but it looked always a bit
> hard coded "just for authors".

Yes, it used to be hard-coded, but we have centralised index
configurations since then.  See for example:

   http://invenio-software.org/ticket/852

In forthcoming Invenio v1.2, one has:

  mysql> select name,tokenizer from idxINDEX;
  +--------------------+------------------------------+
  | name               | tokenizer                    |
  +--------------------+------------------------------+
  | global             | BibIndexDefaultTokenizer     |
  | collection         | BibIndexDefaultTokenizer     |
  | abstract           | BibIndexDefaultTokenizer     |
  | author             | BibIndexAuthorTokenizer      |
  | keyword            | BibIndexDefaultTokenizer     |
  | reference          | BibIndexDefaultTokenizer     |
  | reportnumber       | BibIndexDefaultTokenizer     |
  | title              | BibIndexDefaultTokenizer     |
  | fulltext           | BibIndexFulltextTokenizer    |
  | year               | BibIndexYearTokenizer        |
  | journal            | BibIndexJournalTokenizer     |
  | collaboration      | BibIndexDefaultTokenizer     |
  | affiliation        | BibIndexDefaultTokenizer     |
  | exactauthor        | BibIndexExactAuthorTokenizer |
  | caption            | BibIndexDefaultTokenizer     |
  | firstauthor        | BibIndexAuthorTokenizer      |
  | exactfirstauthor   | BibIndexExactAuthorTokenizer |
  | authorcount        | BibIndexAuthorCountTokenizer |
  | exacttitle         | BibIndexDefaultTokenizer     |
  | authorityauthor    | BibIndexAuthorTokenizer      |
  | authorityinstitute | BibIndexDefaultTokenizer     |
  | authorityjournal   | BibIndexDefaultTokenizer     |
  | authoritysubject   | BibIndexDefaultTokenizer     |
  | itemcount          | BibIndexItemCountTokenizer   |
  | filetype           | BibIndexFiletypeTokenizer    |
  | miscellaneous      | BibIndexDefaultTokenizer     |
  +--------------------+------------------------------+

> So it would always be an exact match type query, right?

Yes, provided that you don't use values like:

   $0 P:(DE-Juel1)12345 P:(DE-Juel1)678

by mishap or something.  In this case a phrase search could lead to
false positive, unless you use regexp "/^value$/".  This one was of my
motivations behind RFC, to point out that if somebody needs stricter
matching, the best would be to switch to regexp.

> While if I use aid as a logical field I could (somehow) add a
> tokenizer to it's index that tells the indexer: this has to be taken
> literally.

Yes, you can select one of existing tokenisers via BibIndex Admin Guide,
or if no provided tokeniser suits your needs, you can write a new one
and drop it into "/opt/invenio/lib/python/invenio/bibindex_tokenizers/".

>> For librarian style queries though, there is an "exactauthor" index
>> that behaves stricter here.
>
> Ic. This would, however, then require an explicit "exact"-index for
> all fields that should get the ability for exact searches.

Not necessarily; e.g. for DOI index, only exact matching makes sense,
hence our "doi" index uses "exact" tokeniser only, there is no need to
add another DOI-related index.  You can see how it is (will be)
implemented here:

  http://invenio-software.org/ticket/1655

> Agree. I was just wondering if you want to add something like "search
> those words in this field", and I'd not map this to "" aka phrase
> search. 

Yes, this is akin to not using quotes in our "add-to-search" interface.

Best regards
-- 
Tibor Simko

Reply via email to