One of the token patterns defined by the StandardTokenizer.jj is this:
<NUM: (<ALPHANUM> <P> <HAS_DIGIT>

| <HAS_DIGIT> <P> <ALPHANUM>

| <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+

| <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+

| <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+

| <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+

)

So basically if you have some sequences of characters separated by a "-"
character, sequences that contain a digit will be combined with sequences
which are adjacent to it to form a single token.  That explains why the WS
and YYMM sequences got separated out.  You can alter this behavior this with
some simple changes to StandardTokenizer.jj.

----- Original Message -----
From: "Iain Young" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Tuesday, December 16, 2003 7:46 AM
Subject: RE: Disabling modifiers?


> I think it is a problem with the indexing. I've found another example...
>
> WS-CA-PP00-PROCESS-YYMM
>
> I've looked at the index, and it has been tokenized into 3 words...
>
> WS
> CA-PP00-PROCESS
> YYMM
>
> Looks as though I might have to use a custom tokenizer as well as an
> analyzer then, but any ideas as to why the standard tokenizer would have
> split the variable up like this (i.e. why didn't it split the middle bit,
> only the word off either end)? The only thing I can think of is that there
> are several other variables in the source beginning with WS- or ending
with
> -YYMM, so could the tokenizer have seen this and be doing something clever
> with them?
>
> Thanks,
> Iain
>
> *****************************************
> *  Micro Focus Developer Forum 2004     *
> *  3 days that will make a difference   *
> *  www.microfocus.com/devforum          *
> *****************************************
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to