Re: Bad behaviors of FrenchAnalyzer

Marvin Humphrey Tue, 11 Oct 2005 11:00:53 -0700


On Oct 11, 2005, at 10:04 AM, Hugo Lafayette wrote:

First of all, add maybe I make a false assumption here, but if youstripleading "j'", "t'" and so on, that means that if you make a searchlike:
 +text:"il m'aime"

you will get documents with the sentence "il m'aime" (french for "he
loves me") and document with the sentence "il t'aime" (french for "he
loves you"), which is wrong, right ?

I don't speak French, and I can't tell you whetherLingua::Stem::Snowball strips m' and t' -- the docs say "This methodstrips 's (english) and l', d', ... (french)."

That's a compelling example you have there, though, so I would hopenot. Conceptually, I would want the search to focus on therelatively rare word for "love" rather than on the pronouns.However, if the stemmer strips the pronouns, "m'aime" and "t'aime"would be conflated, which is as you say, "wrong". :) Is "aime" everused in isolation, or is it always hitched to a pronoun?

So if this is correct, this is why I need to index both "m" and "aime"
as distinct tokens.

And I guess this is why "O'Reilly" is not splitted by the

StandardAnalyzer, since you don't want to find the documentscontaining

"N'Reilly".

Actually, the reason is that you wouldn't want to conflate searchesfor "Reilly" and "O'Reilly". Further processing of a token fallsunder the rubric of stemming.

For a more general purpose, I am a native french speaker, but I'm not
sure there are some cases where a string with an apostrophe has to be
split into two (real) searchable tokens. I know the word "aujourd'hui"
(french for "today"), but it's  likely a complete word by itself which
does not need to be splitted again.

So you wouldn't need a search for "aujourd" or "hui" to turn updocuments which contain "aujourd'hui"? Very good.

But then, what about "t'aime"? If a search for "aime" should matchdocuments which contain "t'aime", then that's our problematicexample. You wouldn't care about searching for a pronoun -- EXCEPTwhen trying to match a phrase. If that's the case, then theStandardTokenizer may in fact be inadequate for French -- "t'aime"should be broken up into two tokens: "t" and "aime".

If this is important to you, I could look further, and ask some french
linguists help.

I'm asking because a new version of my own search engine library hasa default tokenizer which keeps apostrophic strings together (likeStandardTokenizer), and I want to be aware of cases where this choicecauses problems. However, it's unlikely I'll change that behavior,as the problem is addressed by making it trivially easy to customizethe tokenizer. So I would say that for my own purposes, consulting alinguist is probably overkill.


Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Bad behaviors of FrenchAnalyzer

Reply via email to