The expected behavior is to sometimes treat a character as indicating a new
token and other times to ignore the same character?

This sounds like behavior that should be much better documented than it
currently is.

Why would this be the default? What cases is it meant for?

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 29, 2005 10:56 AM
To: java-user@lucene.apache.org
Subject: Re: Inconsistent tokenizing of words containing underscores.

That's StandardAnalyzer's expeceted behaviour.  If you want
tokenization to occur only on white spaces, use WhitespaceAnalyzer.  If
you want custom behaviour, you should write an Analyzer (there should
be a FAQ entry with an example).

Otis

--- "Is, Studcio" <[EMAIL PROTECTED]> wrote:

> Hello,
>  
> I'm using Lucene for a few weeks now in a small project and just ran
> into a problem. My index contains words that contain one or more
> underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the
> tokenizer tokenizes / splits the word into multiple tokens at the
> underscores, except the last underscore. 
>  
> For example the word XYZZZY_DE_SA0001 is tokenized as follows:
>  
> 1. Token: XYZZY 
> 2. Token: DE_SA0001
>  
> which is not conforming to expectations. Either the tokenizer should
> split at every underscore or at none.
>  
> I'm using Lucene 1.4.3 with
> org.apache.lucene.analysis.standard.StandardAnalyzer and Java
> 1.4.2_08.
>  
> Has anybody experienced the same behaviour or can explain it? Could
> it
> be a bug in the StandardTokenizer?
>  
> Many thanks in advance
>  
> Sebastian Seitz
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to