Re: Bad behaviors of FrenchAnalyzer

Erik Hatcher Tue, 11 Oct 2005 09:21:27 -0700


On Oct 11, 2005, at 10:52 AM, Hugo Lafayette wrote:

Erik Hatcher wrote:

Rather than changing StandardAnalyzer, you could create a custom
Analyzer that is something along the lines of StandardTokenizer  ->
custom apostrophe splitting filter -> ISOLatinFilter.

Why do not include that in the FrenchStemFilter "next()" methoditself ?

It will be a bad design ?

I've not personally used the FrenchStemFilter, so I cannot comment onits behavior at all. I'm out of my league in that realm.

And I'm quite concerned with performance issue, but it seem's to methatyour solution will only affect "APOSTROPHE" typed token, so theoverhead
will be unexistant, right ?

There is little need to be concerned with analyzer performance, atleast at this stage. First have a problem, then optimize for it. Idon't speculate with performance. But yes, only the apostrophe type(whatever that is, I'm not looking at the code now, but I think its"<APOSTROPHE>", with angle brackets) would need to be caught andsplit, the rest could pass straight through. Again, look at theStandardTokenFilter for an example - it removes apostrophes.

You get a special type for words with interior apostrophes from
StandardTokenizer (look at StandardFilter to see how that works). You
could create a simple TokenFilter that splits apostrophe'd tokens
into two.

I'm not sure to figure out to do that efficiently. Is it somethinglike

that ? :

<code>

private Stack subTokens; //previously initialized

public final Token next() throws IOException {
  Token t = null;
  if (subTokens != null && !subTokens.empty) {
    t = subTokens.pop();
  } else {
    t = input.next();
    if (t != null)
    {
      String type = t.type();
      if (type == APOSTROPHE_TYPE) {
    tokenizeApostrophe(t, subTokens);
      }
    }
  }
  return t;
}

</code>

with "tokenizeApostrophe(Token, Stack)" that split on conditions the
token into 2 others, and push them on the stack.

Using a stack (or only a single spare Token if you will only splitinto two pieces) is a good appraoch. I haven't tried your code, butI recommend writing some unit tests that exercise your filterseparately and ensure it works to split tokens as you expect. :)


    Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Bad behaviors of FrenchAnalyzer

Reply via email to