I have a tokenizer filter that takes tokens and then drops any non
alphanumeric characters
i.e 'this-stuff' becomes 'thisstuff'
but what I actually want it to do is split the one token into multiple
tokens using the non-alphanumeric characters as word boundaries
i.e 'this-stuff' becomes 'this stuff'
How do I do this ?
thanks Paul
(You may be wondering why I just didn't filter out these characters at
the tokenizer stage, but I had to keep them in to solve another problem,
that is they needed to be kept for 'words' that only consisted of
non-alphanumeric characters)
This is my existing class:
public class MusicbrainzTokenizerFilter extends TokenFilter {
/**
* Construct filtering <i>in</i>.
*/
public MusicbrainzTokenizerFilter(TokenStream in) {
super(in);
termAtt = (CharTermAttribute)
addAttribute(CharTermAttribute.class);
typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
}
private static final String ALPHANUMANDPUNCTUATION
=
MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPUNCTUATION];
// this filters uses attribute type
private TypeAttribute typeAtt;
private CharTermAttribute termAtt;
/**
* Returns the next token in the stream, or null at EOS.
* <p>Removes <tt>'</tt> from the words.
* <p>Removes dots from acronyms.
*/
public final boolean incrementToken() throws java.io.IOException {
if (!input.incrementToken()) {
return false;
}
char[] buffer = termAtt.buffer();
final int bufferLength = termAtt.length();
final String type = typeAtt.type();
if (type == ALPHANUMANDPUNCTUATION) { // remove no alpha
numerics
int upto = 0;
for (int i = 0; i < bufferLength; i++) {
char c = buffer[i];
if (!Character.isLetterOrDigit(c) )
{
//Do Nothing, (drop the character)
}
else {
buffer[upto++] = c;
}
}
termAtt.setLength(upto);
}
return true;
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org