That's StandardAnalyzer's expeceted behaviour. If you want tokenization to occur only on white spaces, use WhitespaceAnalyzer. If you want custom behaviour, you should write an Analyzer (there should be a FAQ entry with an example).
Otis --- "Is, Studcio" <[EMAIL PROTECTED]> wrote: > Hello, > > I'm using Lucene for a few weeks now in a small project and just ran > into a problem. My index contains words that contain one or more > underlines, e.g. XYZZZY_DE_SA0001 or XYZZZY_AT0001. Unfortunately the > tokenizer tokenizes / splits the word into multiple tokens at the > underscores, except the last underscore. > > For example the word XYZZZY_DE_SA0001 is tokenized as follows: > > 1. Token: XYZZY > 2. Token: DE_SA0001 > > which is not conforming to expectations. Either the tokenizer should > split at every underscore or at none. > > I'm using Lucene 1.4.3 with > org.apache.lucene.analysis.standard.StandardAnalyzer and Java > 1.4.2_08. > > Has anybody experienced the same behaviour or can explain it? Could > it > be a bug in the StandardTokenizer? > > Many thanks in advance > > Sebastian Seitz > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]