Creating additional tokens from input in a token filter

Paul Taylor Wed, 02 Nov 2011 09:12:39 -0700

I have a tokenizer filter that takes tokens and then drops any nonalphanumeric characters


i.e 'this-stuff' becomes 'thisstuff'

but what I actually want it to do is split the one token into multipletokens using the non-alphanumeric characters as word boundaries


i.e 'this-stuff' becomes 'this stuff'

How do I do this ?

thanks Paul

(You may be wondering why I just didn't filter out these characters atthe tokenizer stage, but I had to keep them in to solve another problem,that is they needed to be kept for 'words' that only consisted ofnon-alphanumeric characters)


This is my existing class:

public class MusicbrainzTokenizerFilter extends TokenFilter {
    /**
     * Construct filtering <i>in</i>.
     */
    public MusicbrainzTokenizerFilter(TokenStream in) {
        super(in);

termAtt = (CharTermAttribute)addAttribute(CharTermAttribute.class);

        typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
    }

    private static final String ALPHANUMANDPUNCTUATION

=MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPUNCTUATION];


    // this filters uses attribute type
    private TypeAttribute       typeAtt;
    private CharTermAttribute   termAtt;

    /**
     * Returns the next token in the stream, or null at EOS.
     * <p>Removes <tt>'</tt> from the words.
     * <p>Removes dots from acronyms.
     */
    public final boolean incrementToken() throws java.io.IOException {
        if (!input.incrementToken()) {
            return false;
        }

        char[] buffer = termAtt.buffer();
        final int bufferLength = termAtt.length();
        final String type = typeAtt.type();

if (type == ALPHANUMANDPUNCTUATION) { // remove no alphanumerics

            int upto = 0;
            for (int i = 0; i < bufferLength; i++) {
                char c = buffer[i];
                if (!Character.isLetterOrDigit(c) )
                {
                    //Do Nothing, (drop the character)
                }
                else {
                    buffer[upto++] = c;
                }
            }
            termAtt.setLength(upto);
        }
        return true;
    }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Creating additional tokens from input in a token filter

Reply via email to