On Fri, 20 Jul 2007 13:04:57 -0700, Marc Moskowitz <[EMAIL PROTECTED]> wrote:

> I have more questions about stemming. The query:
>
> let $x := <text xml:lang="fr">sont es ès</text>,
> $query1 := cts:word-query("être", ("lang=fr")),
> $query2 := cts:word-query("suis", ("lang=fr"))
> return (
> cts:highlight($x, $query1, element hit {$cts:text}),
> cts:highlight($x, $query2, element hit {$cts:text})
> )
>
> produces the results:
>
> <text xml:lang="fr"><hit>sont</hit> <hit>es</hit> ès</text>
> <text xml:lang="fr"><hit>sont</hit> <hit>es</hit> <hit>ès</hit></text>
>
> This seems to indicate that stemmed results get their
> diacritic-sensitive value for stemmed parts from the presence or absence
> of diacritics of the original search term. This seems incorrect, since
> the stemmer in theory has the correct diacritics for the stemmed parts.
> In this case in particular, ès is completely unrelated to être. Is this
> behavior we can affect on a database level or in some other way
> independent of specifying "diacritic-sensitive" for the base query?
> Marc Moskowitz
> Interactive Factory

In general with MarkLogic search, the default diacritic-sensitivity
(and case-sensitivity, for that matter) comes from the query terms.
So, if you use a query term that doesn't have an accent in it, the
presumption is that you are using a diacritic-insensitive search.
Since French stemming is sensitive to diacritics, this is a bad choice
for doing stemmed searches in French.  So in general, we'd advise you
to explicitly choose diacritic-sensitive search if you want to avoid
the hit for "ès".  By (in this case implicitly) asking for a
diacritic-insensitive search, you're essentially asking us to treat
"ès" and "es" as the same thing, and since "es" is a form of "être",
you get the extra hit.  With a language like French, I'd think you'd
pretty much always want diacritic-sensitive search.

That is:
let $x := <text xml:lang="fr">sont es ès</text>,
$query1 := cts:word-query("être", ("lang=fr","diacritic-sensitive")),
$query2 := cts:word-query("suis", ("lang=fr","diacritic-sensitive"))
  return (
  cts:highlight($x, $query1, element hit {$cts:text}),
  cts:highlight($x, $query2, element hit {$cts:text})
)

//Mary

[EMAIL PROTECTED]
Lead Applications Engineer
Mark Logic Corporation

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to