> I ran into a strange behaviour of the StandardTokenizer. > Terms containing a '-' are tokenized differently depending > on the context. > For example, the term 'nl-lt' is split into 'nl' and 'lt'. > The term 'nl-lt0' is tokenized into 'nl-lt0'. > Is this a bug or a feature?
It is designed that way. TypeAttribute of those tokens are different. > Can I avoid it somehow? Do you want to split at '-' char no matter what? If yes, you can replace all '-' characters with whitespace using MappingCharFilter before StandardTokenizer. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org