Re: derive tokens from single token

Erik Hatcher Mon, 29 Sep 2003 06:50:03 -0700

On Monday, September 29, 2003, at 09:19 AM, Hackl, Rene wrote:

I'm looking for a way to implement simultaneous left and right truncation.

The goal is to enable the user to search for e.g. "*hydronaphth*" and find "hexahydronaphthalene" as well as "heptahydronaphthalin".

WildcardQuery, if used directly but not through QueryParser, allows for left and right wildcards like this. Performance is the biggest concern though, as it will have to enumerate all terms in the index to look for matches.

To achieve that functionality, I'd like to index terms in the way that from a token "foobar" the tokens "oobar" and "obar" ( e.g. mininum word length = 4) would be derived and added to the index. I tried to extend TokenFilter, but all I get is either "oobar" or "obar", depends on when 'return' is called.

This is an interesting approach and would certainly give you the search results you are looking for, except that you'll be indexing a ton of terms I'd guess. If there is some other way to split these words by separating by prefix ("hexa", "hepta") and suffix ("alene", "alin") it would likely be better. But maybe its not practical to do so.

But, to the main question....

How could I add such extra tokens to the tokenStream? Any thoughts on this appreciated.

Yes, this can be done. Craig Walls did this with an example Analyzer in his December '02 Java Developer's Journal article. I adapted his example to use to demonstrate this for presentations. Here is my code:

import java.io.IOException;
import java.util.Enumeration;
import java.util.HashMap;
import java.util.ResourceBundle;
import java.util.Stack;
import java.util.StringTokenizer;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;

public class AliasFilter extends TokenFilter {
  private static HashMap aliasMap = new HashMap();

  static {
      aliasMap.put("country","yeehaw");
      aliasMap.put("western","rawhide");
  }

private Stack currentTokenAliases;

  public AliasFilter(TokenStream in) {
    currentTokenAliases = new Stack();
    input = in;
  }

  public Token next() throws IOException {
    if(currentTokenAliases.size() > 0) {
      return (Token)currentTokenAliases.pop();
    }

    Token nextToken = input.next();
    addAliasesToStack(nextToken, currentTokenAliases);

    return nextToken;
  }

  private void addAliasesToStack(Token token, Stack aliasStack) {
    if(token == null) return;

String aliasString = (String)aliasMap.get(token.termText());

if(aliasString == null || aliasString.length() < 1) return;

StringTokenizer tokenizer = new StringTokenizer(aliasString, " "); while(tokenizer.hasMoreElements()) { String nextAlias = tokenizer.nextToken(); Token nextTokenAlias = new Token(nextAlias, 0, nextAlias.length()); aliasStack.push(nextTokenAlias); } } }

Note that this is a braindead example of injecting tokens into the stream, so when "country" or "western" is encountered, another token is added to the stream. In your case you'll do something with the initial token and push all its 4+ length pieces onto the stack.

Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: derive tokens from single token

Reply via email to