Peter,

Have you read chapter 23 of the Developers' Guide? It is available at http://developer.marklogic.com/pubs - section 23.1 talks about the calculation that we perform for cts:score().

I think the interesting point for your question is that scores are calculated based on inverse document frequency (IDF) as well as term frequency (TF). If that doesn't suit your application, you can choose an alternative scoring technique: try score-logtf, or score-simple, as options to cts:search() - http://developer.marklogic.com/pubs/3.1/apidocs/SearchBuiltins.html#search has more information.

It may also be helpful to note that weight is a double. So if weights are capped at 16.0, you can weight other terms below 1.0 to dampen them.

thanks,
-- Mike

Peter Hickman wrote:
This is a follow up from my previous query about search weightings. The problem is a simple search for some text in the opp:body field. If the text is also in the dc:title element in addition to the opp:body then boost the score of those results. Naively I entered the following query.

cts:search((
 /doc,
 cts:or-query((
   cts:element-query(xs:QName("dc:title"),cts:word-query("bach",(),16)),
   cts:element-query(xs:QName("opp:body"),cts:word-query("bach"))
 ))
))


What is happening however does not make any sense. Let me step you through my investigation. Firstly I get a list of the first 13 entries that have "bach", in opp:body.

<results>{
for $x at $i in (cts:search(
 /doc,  cts:element-query(xs:QName("opp:body"),cts:word-query("bach"))
))[1 to 13]
return <result id="{$i}">
 { base-uri($x) } :
 { cts:score($x) } :
 { $x/opp:meta/dc:title/text() }
 </result>
}</results>

<opp:results>
<opp:result id="1">/grove/music/19768 : 465 : Neue Bach-Gesellschaft.</opp:result> <opp:result id="2">/grove/music/01690 : 434 : Bach, Cecilia.</opp:result>
   <opp:result id="3">/grove/music/01696   : 434 : Bach Choir.</opp:result>
<opp:result id="4">/grove/music/52274 : 434 : Bach, P.D.Q.</opp:result> <opp:result id="5">/grove/music/O007770 : 434 : Bach, P. D. Q.</opp:result>
   <opp:result id="6">/grove/music/52912   : 434 : Bach Guild.</opp:result>
<opp:result id="7">/grove/music/01710 : 434 : Bach Society.</opp:result> <opp:result id="8">/opr/t76/e649 : 434 : Bach Gesellschaft</opp:result> <opp:result id="9">/opr/t114/e526 : 434 : Bach Revival</opp:result> <opp:result id="10">/opr/t76/e3128 : 403 : Estro armonico, L’</opp:result> <opp:result id="11">/grove/music/30356 : 403 : Williams, Peter (Frederic)</opp:result> <opp:result id="12">/grove/music/01689 : 403 : Bach, August Wilhelm</opp:result> <opp:result id="13">/grove/music/01692 : 403 : Bach, Vincent [Schrottenbach, Vinzenz]</opp:result>
</opp:results>

Then, just to make sure I searched for "bach" just in dc:title

<results>{
for $x at $i in (cts:search(
 /doc,  cts:element-query(xs:QName("dc:title"),cts:word-query("bach"))
))[1 to 13]
return <result id="{$i}">
 { base-uri($x) } :
 { cts:score($x) } :
 { $x/opp:meta/dc:title/text() }
 </result>
}</results>

<opp:results>
<opp:result id="1">/grove/music/19768 : 465 : Neue Bach-Gesellschaft.</opp:result> <opp:result id="2">/grove/music/01690 : 434 : Bach, Cecilia.</opp:result> <opp:result id="3">/grove/music/01696 : 434 : Bach Choir.</opp:result> <opp:result id="4">/grove/music/52274 : 434 : Bach, P.D.Q.</opp:result> <opp:result id="5">/grove/music/O007770 : 434 : Bach, P. D. Q.</opp:result> <opp:result id="6">/grove/music/52912 : 434 : Bach Guild.</opp:result> <opp:result id="7">/grove/music/01710 : 434 : Bach Society.</opp:result> <opp:result id="8">/opr/t76/e649 : 434 : Bach Gesellschaft</opp:result> <opp:result id="9">/opr/t114/e526 : 434 : Bach Revival</opp:result> <opp:result id="10">/grove/music/01689 : 403 : Bach, August Wilhelm</opp:result> <opp:result id="11">/grove/music/01692 : 403 : Bach, Vincent [Schrottenbach, Vinzenz]</opp:result> <opp:result id="12">/grove/music/01693 : 403 : Bach-Abel Concerts.</opp:result> <opp:result id="13">/grove/music/O006539 : 403 : English Bach Festival.</opp:result>
</opp:results>

Now I combined the two searches with a cts:or-query and no weightings:

<results>{
for $x at $i in (cts:search(
 /doc,  cts:or-query((
 cts:element-query(xs:QName("opp:body"),cts:word-query("bach")),
 cts:element-query(xs:QName("dc:title"),cts:word-query("bach"))
  ))
))[1 to 13]
return <result id="{$i}">
 { base-uri($x) } :
 { cts:score($x) } :
 { $x/opp:meta/dc:title/text() }</result>
}</results>

<opp:results>
<opp:result id="1">/grove/music/19768 : 465 : Neue Bach-Gesellschaft.</opp:result> <opp:result id="2">/grove/music/01690 : 434 : Bach, Cecilia.</opp:result>
   <opp:result id="3">/grove/music/01696   : 434 : Bach Choir.</opp:result>
<opp:result id="4">/grove/music/52274 : 434 : Bach, P.D.Q.</opp:result> <opp:result id="5">/grove/music/O007770 : 434 : Bach, P. D. Q.</opp:result>
   <opp:result id="6">/grove/music/52912   : 434 : Bach Guild.</opp:result>
<opp:result id="7">/grove/music/01710 : 434 : Bach Society.</opp:result> <opp:result id="8">/opr/t76/e649 : 434 : Bach Gesellschaft</opp:result> <opp:result id="9">/opr/t114/e526 : 434 : Bach Revival</opp:result> <opp:result id="10">/opr/t76/e3128 : 403 : Estro armonico, L’</opp:result> <opp:result id="11">/grove/music/30356 : 403 : Williams, Peter (Frederic)</opp:result> <opp:result id="12">/grove/music/01689 : 403 : Bach, August Wilhelm</opp:result> <opp:result id="13">/grove/music/01692 : 403 : Bach, Vincent [Schrottenbach, Vinzenz]</opp:result>
</opp:results>

The results to note are 10 and 11, these are documents that do not contain "bach" in the dc:title element but have identical scores to documents that do (results 12 and 13). So now I add some weighting to the query for the dc:title element.

<results>{
for $x at $i in (cts:search(
 /doc,  cts:or-query((
 cts:element-query(xs:QName("dc:title"),cts:word-query("bach",(),16)),
 cts:element-query(xs:QName("opp:body"),cts:word-query("bach"))
  ))
))[1 to 13]
return <result id="{$i}">
 { base-uri($x) } :
 { cts:score($x) } :
 { $x/opp:meta/dc:title/text() }</result>
}</results>

<opp:results>
<opp:result id="1">/grove/music/19768 : 474 : Neue Bach-Gesellschaft.</opp:result> <opp:result id="2">/grove/music/01690 : 443 : Bach, Cecilia.</opp:result>
   <opp:result id="3">/grove/music/01696   : 443 : Bach Choir.</opp:result>
<opp:result id="4">/grove/music/52274 : 443 : Bach, P.D.Q.</opp:result> <opp:result id="5">/grove/music/O007770 : 443 : Bach, P. D. Q.</opp:result>
   <opp:result id="6">/grove/music/52912   : 443 : Bach Guild.</opp:result>
<opp:result id="7">/grove/music/01710 : 443 : Bach Society.</opp:result> <opp:result id="8">/opr/t76/e649 : 443 : Bach Gesellschaft</opp:result> <opp:result id="9">/opr/t114/e526 : 443 : Bach Revival</opp:result> <opp:result id="10">/opr/t76/e3128 : 411 : Estro armonico, L’</opp:result> <opp:result id="11">/grove/music/30356 : 411 : Williams, Peter (Frederic)</opp:result> <opp:result id="12">/grove/music/01689 : 411 : Bach, August Wilhelm</opp:result> <opp:result id="13">/grove/music/01692 : 411 : Bach, Vincent [Schrottenbach, Vinzenz]</opp:result>
</opp:results>

Result .:   1   2   3   4   5   6   7   8   9  10  11  12  13
Before .: 465 434 434 434 434 434 434 434 434 403 403 403 403
After ..: 474 443 443 443 443 443 443 443 443 411 411 411 411

As you can see the scores for all the results have changed, including those for results 10 and 11 which have received the same minuscule boost as 12 and 13. Remembering that 10 and 11 do not have "bach" in the dc:title element and so I would have expected that they would not have received a boost. So the net effect is that everything has changed and everything has stayed the same (probably sounds better in French).

Whatever I do the ordering will remain the same, I have tried some completely insane values (only to discover that the max appears to be 16) and the only outcome is that all the results change by the same amount and the ordering remains unaltered.

I am beginning to suspect that the whole query weighting song and dance is just plain broken.

Can someone please tell me what I am doing wrong or what else I might try?



Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to