On Oct 11, 2005, at 10:52 AM, Hugo Lafayette wrote:
Erik Hatcher wrote:


Rather than changing StandardAnalyzer, you could create a custom
Analyzer that is something along the lines of StandardTokenizer  ->
custom apostrophe splitting filter -> ISOLatinFilter.


Why do not include that in the FrenchStemFilter "next()" method itself ?
It will be a bad design ?

I've not personally used the FrenchStemFilter, so I cannot comment on its behavior at all. I'm out of my league in that realm.

And I'm quite concerned with performance issue, but it seem's to me that your solution will only affect "APOSTROPHE" typed token, so the overhead
will be unexistant, right ?

There is little need to be concerned with analyzer performance, at least at this stage. First have a problem, then optimize for it. I don't speculate with performance. But yes, only the apostrophe type (whatever that is, I'm not looking at the code now, but I think its "<APOSTROPHE>", with angle brackets) would need to be caught and split, the rest could pass straight through. Again, look at the StandardTokenFilter for an example - it removes apostrophes.

You get a special type for words with interior apostrophes from
StandardTokenizer (look at StandardFilter to see how that works). You
could create a simple TokenFilter that splits apostrophe'd tokens
into two.


I'm not sure to figure out to do that efficiently. Is it something like
that ? :

<code>

private Stack subTokens; //previously initialized

public final Token next() throws IOException {
  Token t = null;
  if (subTokens != null && !subTokens.empty) {
    t = subTokens.pop();
  } else {
    t = input.next();
    if (t != null)
    {
      String type = t.type();
      if (type == APOSTROPHE_TYPE) {
    tokenizeApostrophe(t, subTokens);
      }
    }
  }
  return t;
}

</code>

with "tokenizeApostrophe(Token, Stack)" that split on conditions the
token into 2 others, and push them on the stack.

Using a stack (or only a single spare Token if you will only split into two pieces) is a good appraoch. I haven't tried your code, but I recommend writing some unit tests that exercise your filter separately and ensure it works to split tokens as you expect. :)

    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to