Re: [MarkLogic Dev General] Sorting search results based on term frequency

Arne Kampf Mon, 17 Dec 2012 08:46:15 -0800

Mary, Mike,

Thanks for the fast response


In our "real world" database there will be more documents than in this test 
example, nevertheless the problem remains that the element we base our search 
on is rather short (as well as our documents overall), so the calculation will 
determine the same value for almost every hit (there are additional criteria in 
our app, but they are most as important, so the weight defines this short 
element to come first).

I checked the log after turning on the diagnostic information, the messages 
show - as expected - the same figures for both documents (and for most "real 
docs").

In order to realize the sort order which is desired: do you have any other idea 
than using score-simple and dividing this result by the length of the element? 
I'm still afraid of the performance implication, as I would have to gather all 
results first in order to sort them in the "second step" by this new figure - 
this sound to me like pagination would have to be implemented "manually" in 
this case...

Thanks,
        Arne


-----Ursprüngliche Nachricht-----
Von: [email protected] 
[mailto:[email protected]] Im Auftrag von Mary Holstege
Gesendet: Montag, 17. Dezember 2012 17:11
An: MarkLogic Developer Discussion
Betreff: Re: [MarkLogic Dev General] Sorting search results based on term 
frequency

On Mon, 17 Dec 2012 07:05:05 -0800, Michael Sokolov <[email protected]>
wrote:

> My guess is you may be seeing the effect of limited precision in the 
> score calculation, but I'm sure someone from MarkLogic will give you a 
> more confident answer :)
>
> -Mike

Yes, that's about it. Scoring in MarkLogic is calculated through step functions 
scaled into integers, pre-computed in tables.
So everything in particular ranges of values will come out with the same 
result. The other thing to note is that score calculations, including document 
size normalization, involve logarithms.

Normalization is designed to differentiate a 1K document from a 100K document: 
you are working at the extreme low end of the scale, and all values are winding 
up in the same place. The other thing to keep in mind in testing this is the 
impact of the whole corpus if you are using logtfidf: if these are the only two 
documents in your database and they both have the target term, the IDF part of 
scoring may be dominating.

If you do to Group>Default>Diagnostics in the admin UI, you can enable some 
trace events that will show you the details of scoring calculations in the log:
Relevance IDF
Relevance Quality
Relevance TF

Turn on all of those. It will be very chatty, so I would only run this for 
small tests.

//Mary

[email protected]
Principal Engineer
MarkLogic Corporation

>
> On 12/17/2012 9:56 AM, Arne Kampf wrote:
> Hi all,
>
> first of all: sorry if this is a RTM question, I scanned the documents 
> and didn't find any hint... I'm currently having trouble trying to get 
> search results ordered in the way I want them to be. I've looked at 
> the documentation in the search-dev-guide, explaining the different 
> calculation methods like logtfidf, logtf, ..., but didn’t find the 
> solution yet.
>
> I created a very small test containing two documents, and one search 
> command operating on one of the elements (see below). While the first 
> document contains "some more additional text" the second one contains 
> just the one word I'm searching for in the used element. I expected 
> the document with the "perfect match" to come first (the doc as well 
> as the element used for the search is "shorter", so I assumed the term 
> frequency would be different because it is defined as "normalized to 
> take into account the size of the document, so that a word that occurs
> 10 times in a 100 word document will get a higher score than a word 
> that occurs 100 times in a 1,000 word document"). But the result shows 
> that both docs are ranked in the same way (score, confidence, and 
> fitness), so that the sorting is not as desired. The chosen 
> calculation method (logtfidf or logtf) influences the absolute values, 
> but not the result that both docs are treated as "equally meaningful". 
> What I found out during testing is that if the text in the element is “very 
> large”
> (tested with 200+ words), then the desired effect finally occurs, the 
> doc has a lower score. But in our case this does not really help, as 
> the text is a heading, normally containing only one or very few words…
>
> Could you please explain what I'm doing wrong, and/or how I'm able to 
> achieve the desired result? Perhaps it would be an idea to use 
> score-simple and then divide the returned score by the length of the 
> element - but this would make pagination obsolete, as I would have to 
> get all return values first in order to calculate "my own score", 
> which could have negative impact on the runtime (and please correct me 
> if this is wrong as well).
>
> Thanks in advance,
>                Arne Kampf
> ---------
> import module
>         namespace
> search="http://marklogic.com/appservices/search";<http://marklogic.com/appservices/search>
>         at "/MarkLogic/appservices/search/search.xqy";
>
> (: this should come second :)
> let $data1 :=
>     <data>
>         <myowntest>very very long text which is containing MyTestWord 
> only once, so not as relevant as if it is the only word (tested up to
> 100 words)</myowntest>
>         <b>this has nothing to do with the element myowntest, so it 
> should be ignored</b>
>         <b>this has nothing to do with the element myowntest, so it 
> should be ignored</b>
>     </data>
>
> (: this should come first :)
> let $data2 :=
>     <data>
>         <myowntest>MyTestWord</myowntest>
>         <b>this has nothing to do with the element myowntest, so it 
> should be ignored</b>
>     </data>
>
> let $dummy :=
> (xdmp:document-insert("/test/doc1",$data1),xdmp:document-insert("/test
> /doc2",$data2))
>
> let $options-for-search :=
>     <options
> xmlns="http://marklogic.com/appservices/search";<http://marklogic.com/appservices/search>>
>          <constraint name="testconstraint">
>             <word>
>                 <element ns="" name="myowntest"/>
>             </word>
>         </constraint>
>
>         <search-option>score-logtfidf</search-option>
>         <debug>true</debug>
>     </options>
>
> return search:search("testconstraint:mytestword", $options-for-search)
>
> This e-mail is confidential and may contain information that is 
> legally privileged. If you are not the intended recipient of the 
> e-mail or have received it in error, do not disclose its contents to 
> anyone. Please also contact the sende
>
>
>
>
> _______________________________________________
> General mailing list
> [email protected]<mailto:[email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
>
>


--
Using Opera's revolutionary email client: http://www.opera.com/mail/ 
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Sorting search results based on term frequency

Reply via email to