Re: [MarkLogic Dev General] xdmp:estimate overcounting

Will Thompson Wed, 20 Nov 2013 13:22:52 -0800

Thanks for this example, Mike. xdmp:plan is much easier to understand in ML7.


Now that result counts are correct, it’s more obvious that the Search API facet 
counts are often off by a few, always overcounting compared to the total 
returned after the search is executed with the related constraint. 

The problem seems to be that while cts:search is able to estimate result counts 
within only the fragments defined in the searchable expression, 
cts:element-values()/cts:frequency() does not. Therefore any ancestor document 
<chapter> of our fragment root <doc> will be added in with the facet estimate, 
while they are excluded from the search estimate.

Is there a workaround, or is this just a pathological condition of using 
fragment roots?


-Will



On Nov 19, 2013, at 5:15 PM, Michael Blakeley <[email protected]> wrote:

> That makes sense. For SEO purposes here's an example of how xdmp:plan might 
> help debug that sort of thing. The extra output in ML7 makes it clear that 
> with fast-phrase and without word-positions, only two-word terms are checked.
> 
> It is also possible to figure this out from the ML6 plans, but I think the 
> new annotations make it easier to understand.
> 
> -- Mike
> 
> xdmp:plan(
>  cts:search(doc(), cts:word-query('dog cat rat')))
> 
> (: fast-phrase, no word-positions :)
> <qry:query-plan xmlns:qry="http://marklogic.com/cts/query";>
>  <qry:info-trace>xdmp:eval("xdmp:plan(&amp;#13;&amp;#10;  cts:search(doc(), 
> cts:word-query('dog cat ...", (), &lt;options 
> xmlns="xdmp:eval"&gt;&lt;database&gt;14758162542116138691&lt;/database&gt;&lt;modules&gt;17366211626271...&lt;/options&gt;)</qry:info-trace>
>  <qry:info-trace>Analyzing path for search: fn:doc()</qry:info-trace>
>  <qry:info-trace>Step 1 is searchable: fn:doc()</qry:info-trace>
>  <qry:info-trace>Path is fully searchable.</qry:info-trace>
>  <qry:info-trace>Gathering constraints.</qry:info-trace>
>  <qry:word-trace text="dog cat">
>    <qry:key>2096356216808567173</qry:key>
>  </qry:word-trace>
>  <qry:word-trace text="cat rat">
>    <qry:key>12758927055138826609</qry:key>
>  </qry:word-trace>
>  <qry:info-trace>Search query contributed 2 constraints: cts:word-query("dog 
> cat rat", ("lang=en"), 1)</qry:info-trace>
>  <qry:partial-plan>
>    <qry:term-query weight="1">
>      <qry:key>2096356216808567173</qry:key>
>      <qry:annotation>pair(word("dog"),word("cat"))</qry:annotation>
>    </qry:term-query>
>  </qry:partial-plan>
>  <qry:partial-plan>
>    <qry:term-query weight="1">
>      <qry:key>12758927055138826609</qry:key>
>      <qry:annotation>pair(word("cat"),word("rat"))</qry:annotation>
>    </qry:term-query>
>  </qry:partial-plan>
>  <qry:info-trace>Executing search.</qry:info-trace>
>  <qry:final-plan>
>    <qry:and-query>
>      <qry:term-query weight="1">
>       <qry:key>2096356216808567173</qry:key>
>       <qry:annotation>pair(word("dog"),word("cat"))</qry:annotation>
>      </qry:term-query>
>      <qry:term-query weight="1">
>       <qry:key>12758927055138826609</qry:key>
>       <qry:annotation>pair(word("cat"),word("rat"))</qry:annotation>
>      </qry:term-query>
>    </qry:and-query>
>  </qry:final-plan>
>  <qry:info-trace>Selected 0 fragments to filter</qry:info-trace>
>  <qry:result estimate="0"/>
> </qry:query-plan>
> 
> (: word-positions :)
> <qry:query-plan xmlns:qry="http://marklogic.com/cts/query";>
>  <qry:info-trace>xdmp:eval("xdmp:plan(&amp;#13;&amp;#10;  cts:search(doc(), 
> cts:word-query('dog cat ...", (), &lt;options 
> xmlns="xdmp:eval"&gt;&lt;database&gt;18400529833056734238&lt;/database&gt;&lt;root&gt;/Users/mblakele/S...&lt;/options&gt;)</qry:info-trace>
>  <qry:info-trace>Analyzing path for search: fn:doc()</qry:info-trace>
>  <qry:info-trace>Step 1 is searchable: fn:doc()</qry:info-trace>
>  <qry:info-trace>Path is fully searchable.</qry:info-trace>
>  <qry:info-trace>Gathering constraints.</qry:info-trace>
>  <qry:word-trace text="dog">
>    <qry:key>5166487143365525844</qry:key>
>  </qry:word-trace>
>  <qry:word-trace text="cat">
>    <qry:key>12545744176132597186</qry:key>
>  </qry:word-trace>
>  <qry:word-trace text="rat">
>    <qry:key>12285550591485045727</qry:key>
>  </qry:word-trace>
>  <qry:info-trace>Search query contributed 1 constraint: cts:word-query("dog 
> cat rat", ("lang=en"), 1)</qry:info-trace>
>  <qry:partial-plan>
>    <qry:word-query weight="1" min-occurs="1" max-occurs="4294967295">
>      <qry:KP pos="0">
>       <qry:key>5166487143365525844</qry:key>
>       <qry:annotation>word("dog")</qry:annotation>
>      </qry:KP>
>      <qry:KP pos="1">
>       <qry:key>12545744176132597186</qry:key>
>       <qry:annotation>word("cat")</qry:annotation>
>      </qry:KP>
>      <qry:KP pos="2">
>       <qry:key>12285550591485045727</qry:key>
>       <qry:annotation>word("rat")</qry:annotation>
>      </qry:KP>
>    </qry:word-query>
>  </qry:partial-plan>
>  <qry:info-trace>Executing search.</qry:info-trace>
>  <qry:final-plan>
>    <qry:and-query>
>      <qry:word-query weight="1" min-occurs="1" max-occurs="4294967295">
>       <qry:KP pos="0">
>         <qry:key>5166487143365525844</qry:key>
>         <qry:annotation>word("dog")</qry:annotation>
>       </qry:KP>
>       <qry:KP pos="1">
>         <qry:key>12545744176132597186</qry:key>
>         <qry:annotation>word("cat")</qry:annotation>
>       </qry:KP>
>       <qry:KP pos="2">
>         <qry:key>12285550591485045727</qry:key>
>         <qry:annotation>word("rat")</qry:annotation>
>       </qry:KP>
>      </qry:word-query>
>    </qry:and-query>
>  </qry:final-plan>
>  <qry:info-trace>Selected 0 fragments to filter</qry:info-trace>
>  <qry:result estimate="0"/>
> </qry:query-plan>
> 
> On 19 Nov 2013, at 15:05 , Will Thompson <[email protected]> wrote:
> 
>> I narrowed down the problem to 3+ word phrases. With that hunch, I enabled 
>> word positions, and after reindexing the estimates are now correct.
>> 
>> I was thinking, incorrectly, that estimates would still be accurate with 
>> only fast phrase searches (and not word positions) enabled. But now that I 
>> look back at how that works, it’s clear that would only be true of 2-word 
>> phrases.
>> 
>> -Will
>> 
>> 
>> On Nov 19, 2013, at 3:23 PM, Michael Blakeley <[email protected]> wrote:
>> 
>>> Which release is this? Is the problem limited to a particular word? If so, 
>>> what words?
>>> 
>>> Have you tried a query trace or xdmp:plan yet? If you can run that with ML7 
>>> that is even more useful.
>>> 
>>> -- Mike
>>> 
>>> On 19 Nov 2013, at 12:43 , Will Thompson <[email protected]> wrote:
>>> 
>>>> I’m trying to determine why some search result estimates are overcounted. 
>>>> Documents generally look like:
>>>> 
>>>> <chapter>
>>>> <subchapter>
>>>>     <doc>
>>>>         <section>
>>>> 
>>>> Fragment root is set on <doc> (and no ancestors or descendants of <doc>). 
>>>> count(//doc) = xdmp:estimate(//doc) => true. The searchable expression is 
>>>> xdmp:directory((‘dir1’, ‘dir2’, …), ‘infinity’)//doc. The word query 
>>>> specification explicitly includes <doc> and excludes document root. 
>>>> 
>>>> The documentation suggests to prevent overcounting we just ensure that 1) 
>>>> searchable expressions always select a fragment, and 2) there are no 
>>>> predicates applied to the searchable expression. Are there any other 
>>>> conditions that may cause overcounting of a simple word query?
>>>> 
>>>> -Will
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>> 
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] xdmp:estimate overcounting

Reply via email to