Re: How to add new word separators or punctuation sign to bibindex?

Tibor Simko Mon, 09 Jan 2017 01:13:55 -0800

Dear Ferran:

On Tue, 20 Dec 2016, Ferran Jorba wrote:
> There are two variables in invenio.conf:
> CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS and
> CFG_BIBINDEX_CHARS_PUNCTUATION. If I want to add new chars that
> constitute word separators (for example: ¡ ¿ «», that should act like
> ! ? "), where should they go? For the first ones, it seems that they
> should go with CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS, but the
> last with CFG_BIBINDEX_CHARS_PUNCTUATION, but, for some of them (! ")
> are duplicated. Which is the rule?


Roughly, the incoming phrase is first split into "word blocks" according
to punctuation (governed by CFG_BIBINDEX_CHARS_PUNCTUATION), and later
each word block may be split into further "alphanumeric sub-blocks"
according to alphanumeric separators (governed by
CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS).

Here's an example for illustration:

  $ ipython
  In [1]: from invenio.bibindex_engine_tokenizer import BibIndexWordTokenizer
  In [2]: t = BibIndexWordTokenizer()
  In [3]: t.tokenize('foo-bar')
  Out[3]: ['foo', 'bar', 'foo-bar']
  In [4]: t.tokenize('bar?')
  Out[4]: ['bar']
  In [5]: t.tokenize('bar+')
  Out[5]: ['bar+', 'bar']

In this example, you can see the difference between a punctuation
character (`?`) and an alphanumeric separator (`+`) where the former is
stripped while the latter is preserved. This permits to nicely index
terms such as C++.

For your use case, it would seem sufficient to add the new characters as
punctuation. You can experiment with CFG_BIBINDEX_CHARS_PUNCTUATION and
CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS settings in your instance and
then use the above Python code to see the effects without having to run
full reindexing.

Best regards
--
Tibor Simko

Re: How to add new word separators or punctuation sign to bibindex?

Reply via email to