Re: [MarkLogic Dev General] Sorting search results based on term frequency

Michael Sokolov Mon, 17 Dec 2012 07:05:27 -0800

My guess is you may be seeing the effect of limited precision in thescore calculation, but I'm sure someone from MarkLogic will give you amore confident answer :)


-Mike


On 12/17/2012 9:56 AM, Arne Kampf wrote:

Hi all,
first of all: sorry if this is a RTM question, I scanned the documentsand didn't find any hint... I'm currently having trouble trying to getsearch results ordered in the way I want them to be. I've looked atthe documentation in the search-dev-guide, explaining the differentcalculation methods like logtfidf, logtf, ..., but didn't find thesolution yet.
I created a very small test containing two documents, and one searchcommand operating on one of the elements (see below). While the firstdocument contains "some more additional text" the second one containsjust the one word I'm searching for in the used element. I expectedthe document with the "perfect match" to come first (the doc as wellas the element used for the search is "shorter", so I assumed the termfrequency would be different because it is defined as "normalized totake into account the size of the document, so that a word that occurs10 times in a 100 word document will get a higher score than a wordthat occurs 100 times in a 1,000 word document"). But the result showsthat both docs are ranked in the same way (score, confidence, andfitness), so that the sorting is not as desired. The chosencalculation method (logtfidf or logtf) influences the absolute values,but not the result that both docs are treated as "equally meaningful".What I found out during testing is that if the text in the element is"very large" (tested with 200+ words), then the desired effect finallyoccurs, the doc has a lower score. But in our case this does notreally help, as the text is a heading, normally containing only one orvery few words...
Could you please explain what I'm doing wrong, and/or how I'm able toachieve the desired result? Perhaps it would be an idea to usescore-simple and then divide the returned score by the length of theelement - but this would make pagination obsolete, as I would have toget all return values first in order to calculate "my own score",which could have negative impact on the runtime (and please correct meif this is wrong as well).
Thanks in advance,

 Arne Kampf

---------

import module

 namespace search="http://marklogic.com/appservices/search";

 at "/MarkLogic/appservices/search/search.xqy";

(: this should come second :)

let $data1 :=

 <data>
<myowntest>very very long text which is containing MyTestWord onlyonce, so not as relevant as if it is the only word (tested up to 100words)</myowntest>
this has nothing to do with the element myowntest, so itshould be ignored
this has nothing to do with the element myowntest, so itshould be ignored
 </data>

(: this should come first :)

let $data2 :=

 <data>

<myowntest>MyTestWord</myowntest>
this has nothing to do with the element myowntest, so itshould be ignored
 </data>
let $dummy :=(xdmp:document-insert("/test/doc1",$data1),xdmp:document-insert("/test/doc2",$data2))
let $options-for-search :=

 <options xmlns="http://marklogic.com/appservices/search";>

 <constraint name="testconstraint">

 <word>

<element ns="" name="myowntest"/>

</word>

</constraint>

<search-option>score-logtfidf</search-option>

<debug>true</debug>

 </options>

return search:search("testconstraint:mytestword", $options-for-search)

**
This e-mail is confidential and may contain information that islegally privileged. If you are not the intended recipient of thee-mail or have received it in error, do not disclose its contents toanyone. Please also contact the sende
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Sorting search results based on term frequency

Reply via email to