Hi Paul, There is WordDelimiterFilter which does exactly what you want. In 3.x its unfortunately only shipped in Solr JAR file, but in 4.0 it's in the analyzers-common module.
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Paul Taylor [mailto:paul_t...@fastmail.fm] > Sent: Wednesday, November 02, 2011 5:12 PM > To: java-user@lucene.apache.org > Subject: Creating additional tokens from input in a token filter > > I have a tokenizer filter that takes tokens and then drops any non alphanumeric > characters > > i.e 'this-stuff' becomes 'thisstuff' > > but what I actually want it to do is split the one token into multiple tokens using > the non-alphanumeric characters as word boundaries > > i.e 'this-stuff' becomes 'this stuff' > > How do I do this ? > > thanks Paul > > (You may be wondering why I just didn't filter out these characters at the > tokenizer stage, but I had to keep them in to solve another problem, that is they > needed to be kept for 'words' that only consisted of non-alphanumeric > characters) > > This is my existing class: > > public class MusicbrainzTokenizerFilter extends TokenFilter { > /** > * Construct filtering <i>in</i>. > */ > public MusicbrainzTokenizerFilter(TokenStream in) { > super(in); > termAtt = (CharTermAttribute) > addAttribute(CharTermAttribute.class); > typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class); > } > > private static final String ALPHANUMANDPUNCTUATION > = > MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPU > NCTUATION]; > > // this filters uses attribute type > private TypeAttribute typeAtt; > private CharTermAttribute termAtt; > > /** > * Returns the next token in the stream, or null at EOS. > * <p>Removes <tt>'</tt> from the words. > * <p>Removes dots from acronyms. > */ > public final boolean incrementToken() throws java.io.IOException { > if (!input.incrementToken()) { > return false; > } > > char[] buffer = termAtt.buffer(); > final int bufferLength = termAtt.length(); > final String type = typeAtt.type(); > > if (type == ALPHANUMANDPUNCTUATION) { // remove no alpha > numerics > int upto = 0; > for (int i = 0; i < bufferLength; i++) { > char c = buffer[i]; > if (!Character.isLetterOrDigit(c) ) > { > //Do Nothing, (drop the character) > } > else { > buffer[upto++] = c; > } > } > termAtt.setLength(upto); > } > return true; > } > } > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org