Re: Strange behaviour of StandardTokenizer

Simon Willnauer Fri, 18 Jun 2010 00:53:00 -0700

Hi Anna,

what are you using you tokenizer for? There are a lot of different
options in lucene an StandardTokenizer is not necessarily the best
one. The behaviour you are see is that the tokenizer detects you token
as a number. When you look at the grammar that is kind of obvious.


<snip>
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)

// punctuation
P                = ("_"|"-"|"/"|"."|",")

</snip>

you can either build your own custom filter which fixed only the
problem with numbers containing a '- ', use the MappingCharFilter or
switch to a different tokenizer.
If you could talk more about your usecase you might get better suggestions.

Simon

On Fri, Jun 18, 2010 at 9:03 AM, Anna Hunecke <[email protected]> wrote:
> Hi Ahmet,
> thanks for the explanation. :)
> okay, so it is recognized as a number? I didn't expect that really. I expect 
> that all words are either split at the minus or not.
> Maybe I'll have to use another tokenizer.
> Best,
> Anna
>
> --- Ahmet Arslan <[email protected]> schrieb am Do, 17.6.2010:
>
>> Von: Ahmet Arslan <[email protected]>
>> Betreff: Re: Strange behaviour of StandardTokenizer
>> An: [email protected]
>> Datum: Donnerstag, 17. Juni, 2010 15:50 Uhr
>>
>> > I ran into a strange behaviour of the
>> StandardTokenizer.
>> > Terms containing a '-' are tokenized differently
>> depending
>> > on the context.
>> > For example, the term 'nl-lt' is split into 'nl' and
>> 'lt'.
>> > The term 'nl-lt0' is tokenized into 'nl-lt0'.
>> > Is this a bug or a feature?
>>
>> It is designed that way. TypeAttribute of those tokens are
>> different.
>>
>> > Can I avoid it somehow?
>>
>> Do you want to split at '-' char no matter what? If yes,
>> you can replace all '-' characters with whitespace using
>> MappingCharFilter before StandardTokenizer.
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Strange behaviour of StandardTokenizer

Reply via email to