> I ran into a strange behaviour of the StandardTokenizer.
> Terms containing a '-' are tokenized differently depending
> on the context.
> For example, the term 'nl-lt' is split into 'nl' and 'lt'.
> The term 'nl-lt0' is tokenized into 'nl-lt0'.
> Is this a bug or a feature?
It is designed that way. TypeAttribute of those tokens are different.
> Can I avoid it somehow?
Do you want to split at '-' char no matter what? If yes, you can replace all
'-' characters with whitespace using MappingCharFilter before
StandardTokenizer.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]