My guess is you may be seeing the effect of limited precision in the
score calculation, but I'm sure someone from MarkLogic will give you a
more confident answer :)
-Mike
On 12/17/2012 9:56 AM, Arne Kampf wrote:
Hi all,
first of all: sorry if this is a RTM question, I scanned the documents
and didn't find any hint... I'm currently having trouble trying to get
search results ordered in the way I want them to be. I've looked at
the documentation in the search-dev-guide, explaining the different
calculation methods like logtfidf, logtf, ..., but didn't find the
solution yet.
I created a very small test containing two documents, and one search
command operating on one of the elements (see below). While the first
document contains "some more additional text" the second one contains
just the one word I'm searching for in the used element. I expected
the document with the "perfect match" to come first (the doc as well
as the element used for the search is "shorter", so I assumed the term
frequency would be different because it is defined as "normalized to
take into account the size of the document, so that a word that occurs
10 times in a 100 word document will get a higher score than a word
that occurs 100 times in a 1,000 word document"). But the result shows
that both docs are ranked in the same way (score, confidence, and
fitness), so that the sorting is not as desired. The chosen
calculation method (logtfidf or logtf) influences the absolute values,
but not the result that both docs are treated as "equally meaningful".
What I found out during testing is that if the text in the element is
"very large" (tested with 200+ words), then the desired effect finally
occurs, the doc has a lower score. But in our case this does not
really help, as the text is a heading, normally containing only one or
very few words...
Could you please explain what I'm doing wrong, and/or how I'm able to
achieve the desired result? Perhaps it would be an idea to use
score-simple and then divide the returned score by the length of the
element - but this would make pagination obsolete, as I would have to
get all return values first in order to calculate "my own score",
which could have negative impact on the runtime (and please correct me
if this is wrong as well).
Thanks in advance,
Arne Kampf
---------
import module
namespace search="http://marklogic.com/appservices/search"
at "/MarkLogic/appservices/search/search.xqy";
(: this should come second :)
let $data1 :=
<data>
<myowntest>very very long text which is containing MyTestWord only
once, so not as relevant as if it is the only word (tested up to 100
words)</myowntest>
<b>this has nothing to do with the element myowntest, so it
should be ignored</b>
<b>this has nothing to do with the element myowntest, so it
should be ignored</b>
</data>
(: this should come first :)
let $data2 :=
<data>
<myowntest>MyTestWord</myowntest>
<b>this has nothing to do with the element myowntest, so it
should be ignored</b>
</data>
let $dummy :=
(xdmp:document-insert("/test/doc1",$data1),xdmp:document-insert("/test/doc2",$data2))
let $options-for-search :=
<options xmlns="http://marklogic.com/appservices/search">
<constraint name="testconstraint">
<word>
<element ns="" name="myowntest"/>
</word>
</constraint>
<search-option>score-logtfidf</search-option>
<debug>true</debug>
</options>
return search:search("testconstraint:mytestword", $options-for-search)
**
This e-mail is confidential and may contain information that is
legally privileged. If you are not the intended recipient of the
e-mail or have received it in error, do not disclose its contents to
anyone. Please also contact the sende
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general