Hi Anna, what are you using you tokenizer for? There are a lot of different options in lucene an StandardTokenizer is not necessarily the best one. The behaviour you are see is that the tokenizer detects you token as a number. When you look at the grammar that is kind of obvious.
<snip> // floating point, serial, model numbers, ip addresses, etc. // every other segment must have at least one digit NUM = ({ALPHANUM} {P} {HAS_DIGIT} | {HAS_DIGIT} {P} {ALPHANUM} | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+ | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+) // punctuation P = ("_"|"-"|"/"|"."|",") </snip> you can either build your own custom filter which fixed only the problem with numbers containing a '- ', use the MappingCharFilter or switch to a different tokenizer. If you could talk more about your usecase you might get better suggestions. Simon On Fri, Jun 18, 2010 at 9:03 AM, Anna Hunecke <annahune...@yahoo.de> wrote: > Hi Ahmet, > thanks for the explanation. :) > okay, so it is recognized as a number? I didn't expect that really. I expect > that all words are either split at the minus or not. > Maybe I'll have to use another tokenizer. > Best, > Anna > > --- Ahmet Arslan <iori...@yahoo.com> schrieb am Do, 17.6.2010: > >> Von: Ahmet Arslan <iori...@yahoo.com> >> Betreff: Re: Strange behaviour of StandardTokenizer >> An: java-user@lucene.apache.org >> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr >> >> > I ran into a strange behaviour of the >> StandardTokenizer. >> > Terms containing a '-' are tokenized differently >> depending >> > on the context. >> > For example, the term 'nl-lt' is split into 'nl' and >> 'lt'. >> > The term 'nl-lt0' is tokenized into 'nl-lt0'. >> > Is this a bug or a feature? >> >> It is designed that way. TypeAttribute of those tokens are >> different. >> >> > Can I avoid it somehow? >> >> Do you want to split at '-' char no matter what? If yes, >> you can replace all '-' characters with whitespace using >> MappingCharFilter before StandardTokenizer. >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org