Hello Tibor,
At Mon, 9 Jan 2017 10:12:09 +0100,
Tibor Simko wrote:

[...]
> For your use case, it would seem sufficient to add the new
> characters as punctuation. You can experiment with
> CFG_BIBINDEX_CHARS_PUNCTUATION and
> CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS settings in your instance
> and then use the above Python code to see the effects without having
> to run full reindexing.

Thank you for this clear explanation.  Indeed, adding \«\»\¡\¿ to
CFG_BIBINDEX_CHARS_PUNCTUATION seems to be enough:

In [1]: from invenio.bibindex_engine_tokenizer import BibIndexWordTokenizer
In [2]: t = BibIndexWordTokenizer()
In [3]: t.tokenize('The man said: "Go, run!"')
Out[3]: ['go', 'the', 'said', 'run', 'man']
In [4]: t.tokenize('The man said: «¡Go, run!»')
Out[4]: ['go', 'the', 'said', 'run', 'man']

However, adding them to CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS
does not change this behaviour, which seems ok to us.  In the
following days and weeks I'll proceed a massive reindexing.

Best regards,

Ferran

Reply via email to