Hi

I'm still facing this problem.

I'm using lib-search and the idea of editing it's queries to include 
and-queries for every possible language isn't viable. For a start I have no way 
of knowing all the possible languages my content may be in.

Disabling stemmed searches isn't an option because it is one of the key 
features we rely on. I have to be able to used stemmed searches for English 
content, and at the same time return matches from content in other languages.

So... here's my current plan, and I'd appreciate feedback on whether there's a 
better solution:

Remove all xml:lang attributes from all content. Replace with a custom meta 
tag, something like <meta:Lang>de</meta:Lang>, so that we don't lose the 
language info but MarkLogic doesn't auto-detect it.

I don’t like this solution but can't think of anything else. Personally I think 
this is a poor feature of MarkLogic. Turning stemming on/off should not affect 
the content base searched. Everything should be searched, with content in the 
configured language gaining the benefits of stemming.

Any comments/suggestions would be really welcome!

Thank you
Rob




-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: 10 February 2009 18:45
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] stemmed searches

The basic approach is to expand your search to search across the languages you 
are interested in.  For example, if a user enters a search term:

cat chat

and your content is in English and French, then you can expand into the 
following cts:query:

cts:or-query((
  cts:and-query((cts:word-query("cat", "lang=en"),  
                 cts:word-query("chat", "lang=en"))),
  cts:and-query((cts:word-query("cat", "lang=fr"),  
                 cts:word-query("chat", "lang=fr")))
))

It is up to you how you decide to parse the user input.

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Whitby, Rob, CMG
Sent: Tuesday, February 10, 2009 9:08 AM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] stemmed searches

Can anyone help me with this issue? What is the best way to deal with content 
in multiple languages?

Thanks
Rob


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Whitby, Rob, CMG
Sent: 06 February 2009 11:41
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] stemmed searches

Thanks for the replies.

I'm using 4.0-1 on 64-bit Windows 2003 Server

I think it is a language thing. Setting the lang option in the stemmed query 
does change the number of results. I'm surprised that stemming has the effect 
of limiting the search to one language, I expected it would still run the 
search on content in other languages but the stemming wouldn't be of any help. 
Even better would be if the stemming was dynamic based on the content language.

The consequences are worrying for general searching. I have content in multiple 
languages and would like the user to be able to enter search terms and receive 
results in any language. Is the only way to fix this to turn off stemming?

I guess I could set the xml:lang attribute to "en" for every article...

Thanks
Rob



-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Mary Holstege
Sent: 05 February 2009 20:13
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] stemmed searches

On Thu, 05 Feb 2009 09:58:19 -0800, Michael Blakeley 
<[email protected]> wrote:

> Rob,
>
> It's always a good idea to state which server release you are using, 
> and on which OS.
>
> The behavior you've observed doesn't look right to me, but I couldn't 
> easily reproduce it either. That suggests that something 
> content-specific or version-specific is at work: if you have a support 
> contract, I'd suggest that you contact support.

One possibility:

Stemmed searches search within a particular language, in this case the default, 
most likely English.  If for some reason the element in question is in some 
other language (e.g. an xml:lang="fr" on the Article element), then that "2009" 
would be in some other language, and therefore wouldn't show up on a stemmed 
English word-query.

//Mary

>
> Meanwhile, you might try some other approaches. Would
> cts:element-value-query() be appropriate for this use-case? Or perhaps 
> a simple XPath?
>
>    /Journal/Volume/Issue/Article/PublishDate/Year[. eq 2009]
>
> If a word-query is what you want, it would be more efficient to write 
> this as an element-word-query:
>
>   cts:search(
>     /Journal/Volume/Issue/Article/PublishDate,
>     cts:element-word-query(xs:QName('Year'), "2009", ("unstemmed"), 1)
>   )
>
> thanks,
> -- Mike
>
> On 2009-02-05 07:14, Whitby, Rob, CMG wrote:
>> Can anyone explain why these 2 queries return different results?
>>
>> count(
>>    cts:search(
>>      /Journal/Volume/Issue/Article/PublishDate/Year,
>>      cts:word-query("2009", ("unstemmed"), 1)
>>    )
>> )
>>
>> = 3036 (the correct result)
>>
>> count(
>>    cts:search(
>>      /Journal/Volume/Issue/Article/PublishDate/Year,
>>      cts:word-query("2009", ("stemmed"), 1)
>>    )
>> )
>>
>> = 2757
>>
>> Why is the "stemmed" setting causing some matches to be missed?
>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general


_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to