On Wed, 26 Feb 2014, Alexander Wagner wrote: > I think this would solve the issue, indeed. I was not aware that I can > hook up a specific tokenizer to an index. I see in our 1.0 that > there's some magic happening with authors, but it looked always a bit > hard coded "just for authors".
Yes, it used to be hard-coded, but we have centralised index configurations since then. See for example: http://invenio-software.org/ticket/852 In forthcoming Invenio v1.2, one has: mysql> select name,tokenizer from idxINDEX; +--------------------+------------------------------+ | name | tokenizer | +--------------------+------------------------------+ | global | BibIndexDefaultTokenizer | | collection | BibIndexDefaultTokenizer | | abstract | BibIndexDefaultTokenizer | | author | BibIndexAuthorTokenizer | | keyword | BibIndexDefaultTokenizer | | reference | BibIndexDefaultTokenizer | | reportnumber | BibIndexDefaultTokenizer | | title | BibIndexDefaultTokenizer | | fulltext | BibIndexFulltextTokenizer | | year | BibIndexYearTokenizer | | journal | BibIndexJournalTokenizer | | collaboration | BibIndexDefaultTokenizer | | affiliation | BibIndexDefaultTokenizer | | exactauthor | BibIndexExactAuthorTokenizer | | caption | BibIndexDefaultTokenizer | | firstauthor | BibIndexAuthorTokenizer | | exactfirstauthor | BibIndexExactAuthorTokenizer | | authorcount | BibIndexAuthorCountTokenizer | | exacttitle | BibIndexDefaultTokenizer | | authorityauthor | BibIndexAuthorTokenizer | | authorityinstitute | BibIndexDefaultTokenizer | | authorityjournal | BibIndexDefaultTokenizer | | authoritysubject | BibIndexDefaultTokenizer | | itemcount | BibIndexItemCountTokenizer | | filetype | BibIndexFiletypeTokenizer | | miscellaneous | BibIndexDefaultTokenizer | +--------------------+------------------------------+ > So it would always be an exact match type query, right? Yes, provided that you don't use values like: $0 P:(DE-Juel1)12345 P:(DE-Juel1)678 by mishap or something. In this case a phrase search could lead to false positive, unless you use regexp "/^value$/". This one was of my motivations behind RFC, to point out that if somebody needs stricter matching, the best would be to switch to regexp. > While if I use aid as a logical field I could (somehow) add a > tokenizer to it's index that tells the indexer: this has to be taken > literally. Yes, you can select one of existing tokenisers via BibIndex Admin Guide, or if no provided tokeniser suits your needs, you can write a new one and drop it into "/opt/invenio/lib/python/invenio/bibindex_tokenizers/". >> For librarian style queries though, there is an "exactauthor" index >> that behaves stricter here. > > Ic. This would, however, then require an explicit "exact"-index for > all fields that should get the ability for exact searches. Not necessarily; e.g. for DOI index, only exact matching makes sense, hence our "doi" index uses "exact" tokeniser only, there is no need to add another DOI-related index. You can see how it is (will be) implemented here: http://invenio-software.org/ticket/1655 > Agree. I was just wondering if you want to add something like "search > those words in this field", and I'd not map this to "" aka phrase > search. Yes, this is akin to not using quotes in our "add-to-search" interface. Best regards -- Tibor Simko

