Re: [MarkLogic Dev General] xdmp:estimate overcounting

Will Thompson Wed, 20 Nov 2013 14:27:40 -0800

Mike,

Scratch that, I think I got it working. Thanks.


-Will

On Nov 20, 2013, at 3:43 PM, Will Thompson <[email protected]> wrote:

> Geert,
> 
> I set <facet-option>fragment-frequency</facet-option>, just in case, but as 
> far as I can tell it is the default (6.0-4).
> 
> Mike,
> 
> I tried both and-ing the element-query and putting the whole query as a child 
> of element-query, but the results are the same. It seems like what’s 
> happening is that the element constraint just enforces that the result match 
> within a <doc>, which I am guessing is still true when matching a descendant 
> <doc> of <chapter>.
> 
> -Will
> 
> 
> On Nov 20, 2013, at 3:30 PM, Michael Blakeley <[email protected]> wrote:
> 
>> Ideally you'd pass the same searchable expression to the lexicon function 
>> and it would figure out how to resolve it. And that might be the key to a 
>> workaround.
>> 
>> As I understand it the unfiltered part of cts:search combines terms from the 
>> searchable expression with terms from the supplied query. So you could try 
>> to do that yourself: for example //doc is roughly equivalent to 
>> cts:element-query(xs:QName('doc'), cts:and-query(()). Call 
>> cts:element-values with cts:and-query of that new query and your user query.
>> 
>> I'm not sure if that will be 100% effective in every situation, but it's 
>> worth a try.
>> 
>> -- Mike
>> 
>> On 20 Nov 2013, at 13:22 , Will Thompson <[email protected]> wrote:
>> 
>>> Thanks for this example, Mike. xdmp:plan is much easier to understand in 
>>> ML7.
>>> 
>>> Now that result counts are correct, it’s more obvious that the Search API 
>>> facet counts are often off by a few, always overcounting compared to the 
>>> total returned after the search is executed with the related constraint. 
>>> 
>>> The problem seems to be that while cts:search is able to estimate result 
>>> counts within only the fragments defined in the searchable expression, 
>>> cts:element-values()/cts:frequency() does not. Therefore any ancestor 
>>> document <chapter> of our fragment root <doc> will be added in with the 
>>> facet estimate, while they are excluded from the search estimate.
>>> 
>>> Is there a workaround, or is this just a pathological condition of using 
>>> fragment roots?
>>> 
>>> 
>>> -Will
>>> 
>>> 
>>> 
>>> On Nov 19, 2013, at 5:15 PM, Michael Blakeley <[email protected]> wrote:
>>> 
>>>> That makes sense. For SEO purposes here's an example of how xdmp:plan 
>>>> might help debug that sort of thing. The extra output in ML7 makes it 
>>>> clear that with fast-phrase and without word-positions, only two-word 
>>>> terms are checked.
>>>> 
>>>> It is also possible to figure this out from the ML6 plans, but I think the 
>>>> new annotations make it easier to understand.
>>>> 
>>>> -- Mike
>>>> 
>>>> xdmp:plan(
>>>> cts:search(doc(), cts:word-query('dog cat rat')))
>>>> 
>>>> (: fast-phrase, no word-positions :)
>>>> <qry:query-plan xmlns:qry="http://marklogic.com/cts/query";>
>>>> <qry:info-trace>xdmp:eval("xdmp:plan(&amp;#13;&amp;#10;  cts:search(doc(), 
>>>> cts:word-query('dog cat ...", (), &lt;options 
>>>> xmlns="xdmp:eval"&gt;&lt;database&gt;14758162542116138691&lt;/database&gt;&lt;modules&gt;17366211626271...&lt;/options&gt;)</qry:info-trace>
>>>> <qry:info-trace>Analyzing path for search: fn:doc()</qry:info-trace>
>>>> <qry:info-trace>Step 1 is searchable: fn:doc()</qry:info-trace>
>>>> <qry:info-trace>Path is fully searchable.</qry:info-trace>
>>>> <qry:info-trace>Gathering constraints.</qry:info-trace>
>>>> <qry:word-trace text="dog cat">
>>>> <qry:key>2096356216808567173</qry:key>
>>>> </qry:word-trace>
>>>> <qry:word-trace text="cat rat">
>>>> <qry:key>12758927055138826609</qry:key>
>>>> </qry:word-trace>
>>>> <qry:info-trace>Search query contributed 2 constraints: 
>>>> cts:word-query("dog cat rat", ("lang=en"), 1)</qry:info-trace>
>>>> <qry:partial-plan>
>>>> <qry:term-query weight="1">
>>>>   <qry:key>2096356216808567173</qry:key>
>>>>   <qry:annotation>pair(word("dog"),word("cat"))</qry:annotation>
>>>> </qry:term-query>
>>>> </qry:partial-plan>
>>>> <qry:partial-plan>
>>>> <qry:term-query weight="1">
>>>>   <qry:key>12758927055138826609</qry:key>
>>>>   <qry:annotation>pair(word("cat"),word("rat"))</qry:annotation>
>>>> </qry:term-query>
>>>> </qry:partial-plan>
>>>> <qry:info-trace>Executing search.</qry:info-trace>
>>>> <qry:final-plan>
>>>> <qry:and-query>
>>>>   <qry:term-query weight="1">
>>>>    <qry:key>2096356216808567173</qry:key>
>>>>    <qry:annotation>pair(word("dog"),word("cat"))</qry:annotation>
>>>>   </qry:term-query>
>>>>   <qry:term-query weight="1">
>>>>    <qry:key>12758927055138826609</qry:key>
>>>>    <qry:annotation>pair(word("cat"),word("rat"))</qry:annotation>
>>>>   </qry:term-query>
>>>> </qry:and-query>
>>>> </qry:final-plan>
>>>> <qry:info-trace>Selected 0 fragments to filter</qry:info-trace>
>>>> <qry:result estimate="0"/>
>>>> </qry:query-plan>
>>>> 
>>>> (: word-positions :)
>>>> <qry:query-plan xmlns:qry="http://marklogic.com/cts/query";>
>>>> <qry:info-trace>xdmp:eval("xdmp:plan(&amp;#13;&amp;#10;  cts:search(doc(), 
>>>> cts:word-query('dog cat ...", (), &lt;options 
>>>> xmlns="xdmp:eval"&gt;&lt;database&gt;18400529833056734238&lt;/database&gt;&lt;root&gt;/Users/mblakele/S...&lt;/options&gt;)</qry:info-trace>
>>>> <qry:info-trace>Analyzing path for search: fn:doc()</qry:info-trace>
>>>> <qry:info-trace>Step 1 is searchable: fn:doc()</qry:info-trace>
>>>> <qry:info-trace>Path is fully searchable.</qry:info-trace>
>>>> <qry:info-trace>Gathering constraints.</qry:info-trace>
>>>> <qry:word-trace text="dog">
>>>> <qry:key>5166487143365525844</qry:key>
>>>> </qry:word-trace>
>>>> <qry:word-trace text="cat">
>>>> <qry:key>12545744176132597186</qry:key>
>>>> </qry:word-trace>
>>>> <qry:word-trace text="rat">
>>>> <qry:key>12285550591485045727</qry:key>
>>>> </qry:word-trace>
>>>> <qry:info-trace>Search query contributed 1 constraint: cts:word-query("dog 
>>>> cat rat", ("lang=en"), 1)</qry:info-trace>
>>>> <qry:partial-plan>
>>>> <qry:word-query weight="1" min-occurs="1" max-occurs="4294967295">
>>>>   <qry:KP pos="0">
>>>>    <qry:key>5166487143365525844</qry:key>
>>>>    <qry:annotation>word("dog")</qry:annotation>
>>>>   </qry:KP>
>>>>   <qry:KP pos="1">
>>>>    <qry:key>12545744176132597186</qry:key>
>>>>    <qry:annotation>word("cat")</qry:annotation>
>>>>   </qry:KP>
>>>>   <qry:KP pos="2">
>>>>    <qry:key>12285550591485045727</qry:key>
>>>>    <qry:annotation>word("rat")</qry:annotation>
>>>>   </qry:KP>
>>>> </qry:word-query>
>>>> </qry:partial-plan>
>>>> <qry:info-trace>Executing search.</qry:info-trace>
>>>> <qry:final-plan>
>>>> <qry:and-query>
>>>>   <qry:word-query weight="1" min-occurs="1" max-occurs="4294967295">
>>>>    <qry:KP pos="0">
>>>>      <qry:key>5166487143365525844</qry:key>
>>>>      <qry:annotation>word("dog")</qry:annotation>
>>>>    </qry:KP>
>>>>    <qry:KP pos="1">
>>>>      <qry:key>12545744176132597186</qry:key>
>>>>      <qry:annotation>word("cat")</qry:annotation>
>>>>    </qry:KP>
>>>>    <qry:KP pos="2">
>>>>      <qry:key>12285550591485045727</qry:key>
>>>>      <qry:annotation>word("rat")</qry:annotation>
>>>>    </qry:KP>
>>>>   </qry:word-query>
>>>> </qry:and-query>
>>>> </qry:final-plan>
>>>> <qry:info-trace>Selected 0 fragments to filter</qry:info-trace>
>>>> <qry:result estimate="0"/>
>>>> </qry:query-plan>
>>>> 
>>>> On 19 Nov 2013, at 15:05 , Will Thompson <[email protected]> 
>>>> wrote:
>>>> 
>>>>> I narrowed down the problem to 3+ word phrases. With that hunch, I 
>>>>> enabled word positions, and after reindexing the estimates are now 
>>>>> correct.
>>>>> 
>>>>> I was thinking, incorrectly, that estimates would still be accurate with 
>>>>> only fast phrase searches (and not word positions) enabled. But now that 
>>>>> I look back at how that works, it’s clear that would only be true of 
>>>>> 2-word phrases.
>>>>> 
>>>>> -Will
>>>>> 
>>>>> 
>>>>> On Nov 19, 2013, at 3:23 PM, Michael Blakeley <[email protected]> wrote:
>>>>> 
>>>>>> Which release is this? Is the problem limited to a particular word? If 
>>>>>> so, what words?
>>>>>> 
>>>>>> Have you tried a query trace or xdmp:plan yet? If you can run that with 
>>>>>> ML7 that is even more useful.
>>>>>> 
>>>>>> -- Mike
>>>>>> 
>>>>>> On 19 Nov 2013, at 12:43 , Will Thompson <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>>> I’m trying to determine why some search result estimates are 
>>>>>>> overcounted. Documents generally look like:
>>>>>>> 
>>>>>>> <chapter>
>>>>>>> <subchapter>
>>>>>>>  <doc>
>>>>>>>      <section>
>>>>>>> 
>>>>>>> Fragment root is set on <doc> (and no ancestors or descendants of 
>>>>>>> <doc>). count(//doc) = xdmp:estimate(//doc) => true. The searchable 
>>>>>>> expression is xdmp:directory((‘dir1’, ‘dir2’, …), ‘infinity’)//doc. The 
>>>>>>> word query specification explicitly includes <doc> and excludes 
>>>>>>> document root. 
>>>>>>> 
>>>>>>> The documentation suggests to prevent overcounting we just ensure that 
>>>>>>> 1) searchable expressions always select a fragment, and 2) there are no 
>>>>>>> predicates applied to the searchable expression. Are there any other 
>>>>>>> conditions that may cause overcounting of a simple word query?
>>>>>>> 
>>>>>>> -Will
>>>>>>> _______________________________________________
>>>>>>> General mailing list
>>>>>>> [email protected]
>>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> General mailing list
>>>>>> [email protected]
>>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> General mailing list
>>>>> [email protected]
>>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> General mailing list
>>>> [email protected]
>>>> http://developer.marklogic.com/mailman/listinfo/general
>>>> 
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] xdmp:estimate overcounting

Reply via email to