On Oct 11, 2005, at 10:04 AM, Hugo Lafayette wrote:
First of all, add maybe I make a false assumption here, but if you
strip
leading "j'", "t'" and so on, that means that if you make a search
like:
+text:"il m'aime"
you will get documents with the sentence "il m'aime" (french for "he
loves me") and document with the sentence "il t'aime" (french for "he
loves you"), which is wrong, right ?
I don't speak French, and I can't tell you whether
Lingua::Stem::Snowball strips m' and t' -- the docs say "This method
strips 's (english) and l', d', ... (french)."
That's a compelling example you have there, though, so I would hope
not. Conceptually, I would want the search to focus on the
relatively rare word for "love" rather than on the pronouns.
However, if the stemmer strips the pronouns, "m'aime" and "t'aime"
would be conflated, which is as you say, "wrong". :) Is "aime" ever
used in isolation, or is it always hitched to a pronoun?
So if this is correct, this is why I need to index both "m" and "aime"
as distinct tokens.
And I guess this is why "O'Reilly" is not splitted by the
StandardAnalyzer, since you don't want to find the documents
containing
"N'Reilly".
Actually, the reason is that you wouldn't want to conflate searches
for "Reilly" and "O'Reilly". Further processing of a token falls
under the rubric of stemming.
For a more general purpose, I am a native french speaker, but I'm not
sure there are some cases where a string with an apostrophe has to be
split into two (real) searchable tokens. I know the word "aujourd'hui"
(french for "today"), but it's likely a complete word by itself which
does not need to be splitted again.
So you wouldn't need a search for "aujourd" or "hui" to turn up
documents which contain "aujourd'hui"? Very good.
But then, what about "t'aime"? If a search for "aime" should match
documents which contain "t'aime", then that's our problematic
example. You wouldn't care about searching for a pronoun -- EXCEPT
when trying to match a phrase. If that's the case, then the
StandardTokenizer may in fact be inadequate for French -- "t'aime"
should be broken up into two tokens: "t" and "aime".
If this is important to you, I could look further, and ask some french
linguists help.
I'm asking because a new version of my own search engine library has
a default tokenizer which keeps apostrophic strings together (like
StandardTokenizer), and I want to be aware of cases where this choice
causes problems. However, it's unlikely I'll change that behavior,
as the problem is addressed by making it trivially easy to customize
the tokenizer. So I would say that for my own purposes, consulting a
linguist is probably overkill.
Cheers,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]