Hello Tibor,
At Mon, 9 Jan 2017 10:12:09 +0100,
Tibor Simko wrote:
[...]
> For your use case, it would seem sufficient to add the new
> characters as punctuation. You can experiment with
> CFG_BIBINDEX_CHARS_PUNCTUATION and
> CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS settings in your instance
> and then use the above Python code to see the effects without having
> to run full reindexing.
Thank you for this clear explanation. Indeed, adding \«\»\¡\¿ to
CFG_BIBINDEX_CHARS_PUNCTUATION seems to be enough:
In [1]: from invenio.bibindex_engine_tokenizer import BibIndexWordTokenizer
In [2]: t = BibIndexWordTokenizer()
In [3]: t.tokenize('The man said: "Go, run!"')
Out[3]: ['go', 'the', 'said', 'run', 'man']
In [4]: t.tokenize('The man said: «¡Go, run!»')
Out[4]: ['go', 'the', 'said', 'run', 'man']
However, adding them to CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS
does not change this behaviour, which seems ok to us. In the
following days and weeks I'll proceed a massive reindexing.
Best regards,
Ferran