Dear Ferran:
On Tue, 20 Dec 2016, Ferran Jorba wrote:
> There are two variables in invenio.conf:
> CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS and
> CFG_BIBINDEX_CHARS_PUNCTUATION. If I want to add new chars that
> constitute word separators (for example: ¡ ¿ «», that should act like
> ! ? "), where should they go? For the first ones, it seems that they
> should go with CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS, but the
> last with CFG_BIBINDEX_CHARS_PUNCTUATION, but, for some of them (! ")
> are duplicated. Which is the rule?
Roughly, the incoming phrase is first split into "word blocks" according
to punctuation (governed by CFG_BIBINDEX_CHARS_PUNCTUATION), and later
each word block may be split into further "alphanumeric sub-blocks"
according to alphanumeric separators (governed by
CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS).
Here's an example for illustration:
$ ipython
In [1]: from invenio.bibindex_engine_tokenizer import BibIndexWordTokenizer
In [2]: t = BibIndexWordTokenizer()
In [3]: t.tokenize('foo-bar')
Out[3]: ['foo', 'bar', 'foo-bar']
In [4]: t.tokenize('bar?')
Out[4]: ['bar']
In [5]: t.tokenize('bar+')
Out[5]: ['bar+', 'bar']
In this example, you can see the difference between a punctuation
character (`?`) and an alphanumeric separator (`+`) where the former is
stripped while the latter is preserved. This permits to nicely index
terms such as C++.
For your use case, it would seem sufficient to add the new characters as
punctuation. You can experiment with CFG_BIBINDEX_CHARS_PUNCTUATION and
CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS settings in your instance and
then use the above Python code to see the effects without having to run
full reindexing.
Best regards
--
Tibor Simko