Re: Inconsistent tokenizing of words containing underscores.

Otis Gospodnetic Mon, 29 Aug 2005 09:56:14 -0700

That's StandardAnalyzer's expeceted behaviour.  If you want
tokenization to occur only on white spaces, use WhitespaceAnalyzer.  If
you want custom behaviour, you should write an Analyzer (there should
be a FAQ entry with an example).


Otis

--- "Is, Studcio" <[EMAIL PROTECTED]> wrote:

> Hello,
>  
> I'm using Lucene for a few weeks now in a small project and just ran
> into a problem. My index contains words that contain one or more
> underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the
> tokenizer tokenizes / splits the word into multiple tokens at the
> underscores, except the last underscore. 
>  
> For example the word XYZZZY_DE_SA0001 is tokenized as follows:
>  
> 1. Token: XYZZY 
> 2. Token: DE_SA0001
>  
> which is not conforming to expectations. Either the tokenizer should
> split at every underscore or at none.
>  
> I'm using Lucene 1.4.3 with
> org.apache.lucene.analysis.standard.StandardAnalyzer and Java
> 1.4.2_08.
>  
> Has anybody experienced the same behaviour or can explain it? Could
> it
> be a bug in the StandardTokenizer?
>  
> Many thanks in advance
>  
> Sebastian Seitz
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Inconsistent tokenizing of words containing underscores.

Reply via email to