On Oct 11, 2005, at 10:04 AM, Hugo Lafayette wrote:

First of all, add maybe I make a false assumption here, but if you strip leading "j'", "t'" and so on, that means that if you make a search like:

 +text:"il m'aime"

you will get documents with the sentence "il m'aime" (french for "he
loves me") and document with the sentence "il t'aime" (french for "he
loves you"), which is wrong, right ?

I don't speak French, and I can't tell you whether Lingua::Stem::Snowball strips m' and t' -- the docs say "This method strips 's (english) and l', d', ... (french)."

That's a compelling example you have there, though, so I would hope not. Conceptually, I would want the search to focus on the relatively rare word for "love" rather than on the pronouns. However, if the stemmer strips the pronouns, "m'aime" and "t'aime" would be conflated, which is as you say, "wrong". :) Is "aime" ever used in isolation, or is it always hitched to a pronoun?

So if this is correct, this is why I need to index both "m" and "aime"
as distinct tokens.

And I guess this is why "O'Reilly" is not splitted by the
StandardAnalyzer, since you don't want to find the documents containing
"N'Reilly".

Actually, the reason is that you wouldn't want to conflate searches for "Reilly" and "O'Reilly". Further processing of a token falls under the rubric of stemming.

For a more general purpose, I am a native french speaker, but I'm not
sure there are some cases where a string with an apostrophe has to be
split into two (real) searchable tokens. I know the word "aujourd'hui"
(french for "today"), but it's  likely a complete word by itself which
does not need to be splitted again.

So you wouldn't need a search for "aujourd" or "hui" to turn up documents which contain "aujourd'hui"? Very good.

But then, what about "t'aime"? If a search for "aime" should match documents which contain "t'aime", then that's our problematic example. You wouldn't care about searching for a pronoun -- EXCEPT when trying to match a phrase. If that's the case, then the StandardTokenizer may in fact be inadequate for French -- "t'aime" should be broken up into two tokens: "t" and "aime".

If this is important to you, I could look further, and ask some french
linguists help.

I'm asking because a new version of my own search engine library has a default tokenizer which keeps apostrophic strings together (like StandardTokenizer), and I want to be aware of cases where this choice causes problems. However, it's unlikely I'll change that behavior, as the problem is addressed by making it trivially easy to customize the tokenizer. So I would say that for my own purposes, consulting a linguist is probably overkill.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to