[MarkLogic Dev General] Sorting search results based on term frequency

Arne Kampf Mon, 17 Dec 2012 06:54:45 -0800

Hi all,

first of all: sorry if this is a RTM question, I scanned the documents and 
didn't find any hint... I'm currently having trouble trying to get search 
results ordered in the way I want them to be. I've looked at the documentation 
in the search-dev-guide, explaining the different calculation methods like 
logtfidf, logtf, ..., but didn't find the solution yet.


I created a very small test containing two documents, and one search command 
operating on one of the elements (see below). While the first document contains 
"some more additional text" the second one contains just the one word I'm 
searching for in the used element. I expected the document with the "perfect 
match" to come first (the doc as well as the element used for the search is 
"shorter", so I assumed the term frequency would be different because it is 
defined as "normalized to take into account the size of the document, so that a 
word that occurs 10 times in a 100 word document will get a higher score than a 
word that occurs 100 times in a 1,000 word document"). But the result shows 
that both docs are ranked in the same way (score, confidence, and fitness), so 
that the sorting is not as desired. The chosen calculation method (logtfidf or 
logtf) influences the absolute values, but not the result that both docs are 
treated as "equally meaningful". What I found out during testing is that if the 
text in the element is "very large" (tested with 200+ words), then the desired 
effect finally occurs, the doc has a lower score. But in our case this does not 
really help, as the text is a heading, normally containing only one or very few 
words...

Could you please explain what I'm doing wrong, and/or how I'm able to achieve 
the desired result? Perhaps it would be an idea to use score-simple and then 
divide the returned score by the length of the element - but this would make 
pagination obsolete, as I would have to get all return values first in order to 
calculate "my own score", which could have negative impact on the runtime (and 
please correct me if this is wrong as well).

Thanks in advance,
               Arne Kampf
---------
import module
        namespace search="http://marklogic.com/appservices/search";
        at "/MarkLogic/appservices/search/search.xqy";

(: this should come second :)
let $data1 :=
    <data>
        <myowntest>very very long text which is containing MyTestWord only 
once, so not as relevant as if it is the only word (tested up to 100 
words)</myowntest>
        <b>this has nothing to do with the element myowntest, so it should be 
ignored</b>
        <b>this has nothing to do with the element myowntest, so it should be 
ignored</b>
    </data>

(: this should come first :)
let $data2 :=
    <data>
        <myowntest>MyTestWord</myowntest>
        <b>this has nothing to do with the element myowntest, so it should be 
ignored</b>
    </data>

let $dummy := 
(xdmp:document-insert("/test/doc1",$data1),xdmp:document-insert("/test/doc2",$data2))

let $options-for-search :=
    <options xmlns="http://marklogic.com/appservices/search";>
         <constraint name="testconstraint">
            <word>
                <element ns="" name="myowntest"/>
            </word>
        </constraint>

        <search-option>score-logtfidf</search-option>
        <debug>true</debug>
    </options>

return search:search("testconstraint:mytestword", $options-for-search)

This e-mail is confidential and may contain information that is legally 
privileged. If you are not the intended recipient of the e-mail or have 
received it in error, do not disclose its contents to anyone. Please also 
contact the sende

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] Sorting search results based on term frequency

Reply via email to