Re: Creating additional tokens from input in a token filter

Paul Taylor Wed, 02 Nov 2011 13:49:18 -0700

On 02/11/2011 17:15, Uwe Schindler wrote:

Hi Paul,


There is WordDelimiterFilter which does exactly what you want. In 3.x its
unfortunately only shipped in Solr JAR file, but in 4.0 it's in the
analyzers-common module.

Okay so I found it and its looks very interesting but really overlycomplex for what I want to do and doesnt handle my specific case, couldanyone possibly give a code exampleof how I create two tokens from one, assume I already know how to splitit (I cant work that bit out)


    public final boolean incrementToken() throws java.io.IOException {
        if (!input.incrementToken()) {
            return false;
        }

        char[] buffer = termAtt.buffer();
        final int bufferLength = termAtt.length();
        final String type = typeAtt.type();

        if (type == ALPHANUMANDPUNCTUATION) {
            int upto = 0;

            for (int i = 0; i < bufferLength; i++) {
                char c = buffer[i];
                if (!Character.isLetterOrDigit(c) )
                {
                    //TODO PUT ALL CHARS AFTER THIS INTO A NEW TOKEN
                }
                else {
                    buffer[upto++] = c;
                }
            }
            termAtt.setLength(upto);
        }
        return true;
    }

-----Original Message-----
From: Paul Taylor [mailto:paul_t...@fastmail.fm]
Sent: Wednesday, November 02, 2011 5:12 PM
To: java-user@lucene.apache.org
Subject: Creating additional tokens from input in a token filter

I have a tokenizer filter that takes tokens and then drops any non

alphanumeric

characters

i.e 'this-stuff' becomes 'thisstuff'

but what I actually want it to do is split the one token into multiple

tokens using

the non-alphanumeric characters as word boundaries

i.e 'this-stuff' becomes 'this stuff'

How do I do this ?

thanks Paul

(You may be wondering why I just didn't filter out these characters at the
tokenizer stage, but I had to keep them in to solve another problem, that

is they

needed to be kept for 'words' that only consisted of non-alphanumeric
characters)

This is my existing class:

public class MusicbrainzTokenizerFilter extends TokenFilter {
      /**
       * Construct filtering<i>in</i>.
       */
      public MusicbrainzTokenizerFilter(TokenStream in) {
          super(in);
          termAtt = (CharTermAttribute)
addAttribute(CharTermAttribute.class);
          typeAtt = (TypeAttribute) addAttribute(TypeAttribute.class);
      }

      private static final String ALPHANUMANDPUNCTUATION
              =
MusicbrainzTokenizer.TOKEN_TYPES[MusicbrainzTokenizer.ALPHANUMANDPU
NCTUATION];

      // this filters uses attribute type
      private TypeAttribute       typeAtt;
      private CharTermAttribute   termAtt;

      /**
       * Returns the next token in the stream, or null at EOS.
       *<p>Removes<tt>'</tt>  from the words.
       *<p>Removes dots from acronyms.
       */
      public final boolean incrementToken() throws java.io.IOException {
          if (!input.incrementToken()) {
              return false;
          }

          char[] buffer = termAtt.buffer();
          final int bufferLength = termAtt.length();
          final String type = typeAtt.type();

          if (type == ALPHANUMANDPUNCTUATION) {      // remove no alpha
numerics
              int upto = 0;
              for (int i = 0; i<  bufferLength; i++) {
                  char c = buffer[i];
                  if (!Character.isLetterOrDigit(c) )
                  {
                      //Do Nothing, (drop the character)
                  }
                  else {
                      buffer[upto++] = c;
                  }
              }
              termAtt.setLength(upto);
          }
          return true;
      }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Creating additional tokens from input in a token filter

Reply via email to