I have a tokenizer filter that takes tokens and then drops any non alphanumeric characters

i.e 'this-stuff' becomes 'thisstuff'

but what I actually want it to do is split the one token into multiple tokens using the non-alphanumeric characters as word boundaries

i.e 'this-stuff' becomes 'this stuff'

How do I do this ?

thanks Paul

(You may be wondering why I just didn't filter out these characters at the tokenizer stage, but I had to keep them in to solve another problem, that is they needed to be kept for 'words' that only consisted of non-alphanumeric characters)

This is my existing class:

public class MusicbrainzTokenizerFilter extends TokenFilter {
    /**
     * Construct filtering <i>in</i>.
     */
    public MusicbrainzTokenizerFilter(TokenStream in) {
        super(in);
termAtt = (CharTermAttribute) addAttribute(CharTermAttribute.class);
        typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
    }

    private static final String ALPHANUMANDPUNCTUATION
= MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPUNCTUATION];

    // this filters uses attribute type
    private TypeAttribute       typeAtt;
    private CharTermAttribute   termAtt;

    /**
     * Returns the next token in the stream, or null at EOS.
     * <p>Removes <tt>'</tt> from the words.
     * <p>Removes dots from acronyms.
     */
    public final boolean incrementToken() throws java.io.IOException {
        if (!input.incrementToken()) {
            return false;
        }

        char[] buffer = termAtt.buffer();
        final int bufferLength = termAtt.length();
        final String type = typeAtt.type();

if (type == ALPHANUMANDPUNCTUATION) { // remove no alpha numerics
            int upto = 0;
            for (int i = 0; i < bufferLength; i++) {
                char c = buffer[i];
                if (!Character.isLetterOrDigit(c) )
                {
                    //Do Nothing, (drop the character)
                }
                else {
                    buffer[upto++] = c;
                }
            }
            termAtt.setLength(upto);
        }
        return true;
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to