RE: [MarkLogic Dev General] stemmed searches

Whitby, Rob, CMG Mon, 30 Mar 2009 08:56:45 -0700

Thanks Mary and Geert

I'll look into how viable it is to convert the queries in my app to using the 
or-query hack. Hopefully the changes will be limited to lib-parser-custom. I'm 
not as optimistic about the performance, but I'll create some tests.


> if one day you want to start searching that 
> content as the language it really is, you lose
Not really because I would store the language elsewhere, so could restrict the 
search by that instead. A smaller problem that the current one :)

On the wider issue - am I in the minority in thinking it wrong that 
stemmed/unstemmed queries search over different content? To my mind, the 
language parameter should affect the stemming rules NOT the set of documents 
searched. 

Or, unstemmed searches should adhere to the same language restriction. Not as 
useful but at least consistent..

Thanks
Rob



-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Mary Holstege
Sent: 30 March 2009 15:42
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] stemmed searches

On Mon, 30 Mar 2009 07:27:45 -0700, Geert Josten  
<[email protected]> wrote:

> Rob,
>
> You don't need to change the input of your lib-search call. You could  
> also change the lib-parser-custom.xqy, translating the word-query to the  
> or-query you wrote below. But that would affect all word-queries you  
> perform with lib-search unless you can distinguish between specific  
> ones..
>
> And perhaps someone from Mark Logic can reply to the performance part,  
> it might not be as bad as you think..

It is not as bad as you think. :)  Performance of searches such as this is  
largely
dependent on the number of hits, so if the or-query gets you all the hits  
you
expect, so would a stemmed search in the absence of xml:lang, and
their performance would be roughly comparable.  (If you start getting into  
a large
number of query terms this becomes less true, but not here.)

Removing xml:lang attributes so that things that aren't English pretend
to be English doesn't look like a great idea to me: if one day you want to
start searching that content as the language it really is, you lose.  On  
the
other hand, or-ing a stemmed search in English content with a non-stemmed
search in non-English content sounds about right.  It doesn't have to be
"exact": just "unstemmed" is good enough, and then you can handle case-
or diacritic-insensitive searches as well, or wildcards.  The down side is
that you will have to enable word-searches, so your index will be bigger  
(if
you don't have them enabled already).

Cheers

//Mary


>>
>> Yes that explains the problem well.
>>
>> Best solution so far is to rewrite all queries like this:
>>
>> cts:search(doc(),
>>   cts:or-query((
>>     cts:word-query('search', 'exact'),
>>     cts:word-query('search', 'stemmed')
>>   ))
>> )
>>
>> We're using lib-search to generate complex queries on a lot
>> of fields and facets, so making this change everywhere in the
>> lib-search code won't be trivial. And then there's presumably
>> a large performance impact, as it effectively doubles the
>> number of queries.
>>
>> So I'm still planning on removing all the xml:lang attributes...
>>
>> Rob
>>
>>
>>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of
>> David Sewell
>> Sent: 30 March 2009 14:56
>> To: General Mark Logic Developer Discussion
>> Subject: RE: [MarkLogic Dev General] stemmed searches
>>
>> On Mon, 30 Mar 2009, Geert Josten wrote:
>>
>> >> I don't like this solution but can't think of anything else.
>> >> Personally I think this is a poor feature of MarkLogic.
>> >> Turning stemming on/off should not affect the content base
>> searched.
>> >> Everything should be searched, with content in the configured
>> >> language gaining the benefits of stemming.
>> >
>> > Are you sure that stemming is affecting which documents are being
>> > searched? It does ofcourse affects how many results are found, but
>> > since stemming won't work on old english, you will need to enter
>> > exactly matching tokens to find results in old english
>> texts. Stemming
>>
>> > should only increase the hit ratio, not decrease it..
>>
>> We have the same issue. It's more a problem of coding
>> verbosity than anything else. We have stemmed searching set
>> on our main document database. So given data like this
>>
>> <p xml:lang="eng">In an earlier stage of the Common law it was death.
>> <foreign xml:lang="lat">si quis in aula regia pugnet, vel
>> arma sua extrahat et capiatur...</foreign></p>
>>
>> because our default language is English, the following search
>> returns null results:
>>
>>    cts:search(//p, "extrahat")
>>
>> as it is stemmed, and stemmed search works only on text in
>> elements with @xml:lang = English. So the search must be rewritten as
>>
>>    cts:search(//p,
>>       cts:word-query("extrahat", "exact")
>>    )
>>
>> But then you lose the stemmed search, which you might want if
>> the search term was "stage" for example. So either you have
>> to "and" all your searches, or choose between one kind of
>> search or the other.
>>
>>
>> David

Mary Holstege
Lead Engineer
Mark Logic Corporation
999 Skyway Drive
Suite 200
San Carlos, California  94070
+1 650 655 2336 Phone
[email protected]
www.marklogic.com
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] stemmed searches

Reply via email to