Re: [MarkLogic Dev General] Is "INFO" magic in search:search ? bad performance

Lee, David Mon, 11 Jul 2011 12:35:18 -0700

I narrowed this down.
I changed the data to be single documents instead of a top-level with a 
fragment-root.
The effects were not significantly different.


Adding 'unfiltered' made the results instant.
Similarly adding 'case-insensitive' made the results instant.

So what I think is going on here is that "INFO" matches case-insensitively to a 
huge number of fragments but case-sensitively to none.   Thus the unfiltered 
search has to go through the entire list by loading the fragments and doing the 
equivilent of a cts:contains() to find no results.

If "INFO" matched more results case-sensitive then the search could have 
terminated quicker as I see with other such search of words that have both case 
sensitive and insensitive matches.

Suggestions to improve this usably ? 

A) add a case-sensitive index ?
B) Use case-insensitive matching (might be the better GUI in this case).
C) Use un-filtered searches but then I can get a HUGE number of false-positives 
... running cts:contains on the top 10 hits will likely find none so this is 
not usable.

Other ideas ?


----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]
812-482-5224


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Lee, David
Sent: Monday, July 11, 2011 3:08 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Is "INFO" magic in search:search ? bad 
performance

I added
                 <search-option>unfiltered</search-option>

and now the result is almost instant (.14 sec).

I need to ponder what thats affecting ...
In the mean time also experimenting with no sub-document fragmentation (but 
will be a while to reload the DB with 4G of new data!)



----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]
812-482-5224


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Lee, David
Sent: Monday, July 11, 2011 2:58 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Is "INFO" magic in search:search ? bad 
performance

Changing to //log didn't effect things much (slightly faster) but I'm still 
concerned about the error in the search:report

It reports that 308943 fragments were found but then reports none of them.

I suspect the hint about fragmentation is correct.  I'm trying some new code 
that splits up the files into seperate documents.




----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
[email protected]
812-482-5224


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Michael Blakeley
Sent: Monday, July 11, 2011 2:27 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Is "INFO" magic in search:search ? bad 
performance

If it isn't already on your list, you might try changing the searchable 
expression to //log or turn off filtering entirely. I'm still thinking about 
that parent fragment, and filtering can also cross fragment boundaries.

-- Mike

On 11 Jul 2011, at 11:12 , Lee, David wrote:

> Here's the results with snippet processing turned off  and the return-plan 
> and debug on.
> Its clearly in the searching not snippet processing.
> 
> 
> I'm going to try a few other things as well but thought this might be useful 
> quick feedback.
> search:search( "INFO" , 
>               <options xmlns="http://marklogic.com/appservices/search";>
>                 <additional-query>{cts:directory-query("/logs/","infinity")}
>                 </additional-query>
>                 <searchable-expression>/logfile/log</searchable-expression>
>                  <term>
>                       <term-option>unwildcarded</term-option>
>                </term>
>                <sort-order type="dateTime" direction="ascending">
>                     <element ns="" name="log"/>
>                   <attribute ns="" name="time"/>
>                </sort-order>
>            <empty apply="no-results" />
>               <return-plan>true</return-plan> 
>               <debug>true</debug>
>               <transform-results apply="raw">
>               </transform-results>
>               </options>
>               , 1 , 10 )
> 
> ------ Result 
> 
> 
> 
> <search:response total="0" start="1" page-length="10" xmlns="" 
> xmlns:search="http://marklogic.com/appservices/search";>
> <search:plan>
> <qry:query-plan xmlns:qry="http://marklogic.com/cts/query";>
> <qry:info-trace>xdmp:value(xs:untypedAtomic("xdmp:plan(cts:search(/logfile/log,
>  cts:and-query((cts:word-query..."))</qry:info-trace>
> <qry:info-trace>Analyzing path for search: 
> fn:collection()/logfile/log</qry:info-trace>
> <qry:info-trace>Step 1 is searchable: fn:collection()</qry:info-trace>
> <qry:info-trace>Step 2 is searchable: logfile</qry:info-trace>
> <qry:info-trace>Step 3 is searchable: log</qry:info-trace>
> <qry:info-trace>Path is fully searchable.</qry:info-trace>
> <qry:info-trace>Gathering constraints.</qry:info-trace>
> <qry:word-trace text="INFO">
> <qry:key>9355958571458624353</qry:key>
> </qry:word-trace>
> <qry:info-trace>Search query contributed 2 constraints: 
> cts:and-query((cts:word-query("INFO", ("unwildcarded","lang=en"), 1), 
> cts:directory-query("/logs/", "infinity")), ())</qry:info-trace>
> <qry:partial-plan>
> <qry:term-query weight="1">
> <qry:key>9355958571458624353</qry:key>
> </qry:term-query>
> </qry:partial-plan>
> <qry:partial-plan>
> <qry:term-query weight="0">
> <qry:key>7047720329996541787</qry:key>
> </qry:term-query>
> </qry:partial-plan>
> <qry:info-trace>Executing search.</qry:info-trace>
> <qry:final-plan>
> <qry:and-query>
> <qry:term-query weight="0">
> <qry:key>16738249755881875161</qry:key>
> </qry:term-query>
> <qry:or-two-queries>
> <qry:term-query weight="0">
> <qry:key>4112558586381539294</qry:key>
> </qry:term-query>
> <qry:term-query weight="0">
> <qry:key>16457039686800503886</qry:key>
> </qry:term-query>
> </qry:or-two-queries>
> <qry:term-query weight="1">
> <qry:key>9355958571458624353</qry:key>
> </qry:term-query>
> <qry:term-query weight="0">
> <qry:key>7047720329996541787</qry:key>
> </qry:term-query>
> </qry:and-query>
> </qry:final-plan>
> <qry:info-trace>Selected 308943 fragments to filter</qry:info-trace>
> <qry:result estimate="308943"/>
> </qry:query-plan>
> </search:plan>
> <search:qtext>INFO</search:qtext>
> <search:report id="SEARCH-SCHEMAINVALID">
> <error:format-string 
> xmlns:error="http://marklogic.com/xdmp/error";>XDMP-VALIDATEUNEXPECTED: 
> (err:XQDY0027) validate strict { $opt } -- Invalid node: Found search:empty 
> but expected 
> (search:additional-query|search:annotation|search:concurrency-level|search:constraint|search:debug|search:default-suggestion-source|search:forest|search:grammar|search:operator|search:page-length|search:quality-weight|search:return-constraints|search:return-facets|search:return-metrics|search:return-plan|search:return-qtext|search:return-query|search:return-results|search:return-similar|search:search-option|search:searchable-expression|search:sort-order|search:suggestion-source|search:term|search:transform-results)*
>  at /search:options/search:empty using schema 
> "search.xsd"</error:format-string>
> <error:data xmlns:error="http://marklogic.com/xdmp/error";>
> <error:datum>search:empty</error:datum>
> <error:datum>(search:additional-query|search:annotation|search:concurrency-level|search:constraint|search:debug|search:default-suggestion-source|search:forest|search:grammar|search:operator|search:page-length|search:quality-weight|search:return-constraints|search:return-facets|search:return-metrics|search:return-plan|search:return-qtext|search:return-query|search:return-results|search:return-similar|search:search-option|search:searchable-expression|search:sort-order|search:suggestion-source|search:term|search:transform-results)*</error:datum>
> <error:datum>/search:options/search:empty</error:datum>
> <error:datum>"search.xsd"</error:datum>
> </error:data>
> </search:report>
> <search:report id="SEARCH-FLWOR">(for $result in cts:search(/logfile/log, 
> cts:and-query((cts:word-query("INFO", ("unwildcarded","lang=en"), 1), 
> cts:directory-query("/logs/", "infinity")), ()), ("score-logtfidf"), 1) order 
> by (($result//log/@time)[1]) ascending return $result)[1 to 
> 10]</search:report>
> <search:metrics>
> <search:query-resolution-time>PT2M16.717897S</search:query-resolution-time>
> <search:facet-resolution-time>PT0.000086S</search:facet-resolution-time>
> <search:snippet-resolution-time>PT0S</search:snippet-resolution-time>
> <search:total-time>PT2M16.875346S</search:total-time>
> </search:metrics>
> </search:response>
> 
> 
> 
> 
> ----------------------------------------
> David A. Lee
> Senior Principal Software Engineer
> Epocrates, Inc.
> [email protected]
> 812-482-5224
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Micah Dubinko
> Sent: Monday, July 11, 2011 12:45 PM
> To: General MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Is "INFO" magic in search:search ? bad 
> performance
> 
> Another thing you should do is include <return-plan>true</return-plan> in the 
> options node, which will give a full readout of the query plan in the 
> response. Also <debug>true</debug> will provide some more details, including 
> the actual XQuery that implements the query. I'm guessing the word "info" 
> appears very often in the data, and resolving the query might be more work 
> than either of us suspects at this point.
> 
> -m
> 
> 
> 
> On Jul 11, 2011, at 9:06 AM, Michael Blakeley wrote:
> 
>> A call to xdmp:value() could be doing almost anything. But my guess is 
>> snippet generation and highlighting. Going further out on a limb, it may 
>> have something to do with XPath across fragment boundaries. If the snippet 
>> code tries to go up to the '/logfile' root element, it will be loading a 
>> fairly large fragment with many (millions of?) child links. That could get 
>> ugly, especially if the working set is too large for the CPU's on-die caches.
>> 
>> You could test that theory by turning off snippet display, and by creating 
>> some logfile documents of various sizes. If I'm on the right track, you'll 
>> see that elapsed time is related to the number of fragments. I don't know if 
>> it will be O(n) or something worse, though.
>> 
>> I would consider making '/log' the root element, with no subfragments. If 
>> you have metadata in the logfile, you could represent that with a directory 
>> structure under /logs/, and perhaps have a metadata document with a known 
>> base URI inside each directory. Or you could repeat the metadata in each 
>> log-entry document. But I would try to get away from using sub-document 
>> fragments.
>> 
>> -- Mike
>> 
>> On 11 Jul 2011, at 07:55 , Lee, David wrote:
>> 
>>> I'm playing with search:search
>>> I have about 4GB of data (few million xml files)
>>> Most searchs  return with a second or 2 but something magic is happening if 
>>> I use the word "INFO"
>>> 
>>> 
>>> With the following search it takes nearly 3 minutes and returns no results.
>>> I tried other words that both return results and no results and few results 
>>> and can only replicate it with the magic word "INFO"
>>> 
>>> search:search( "INFO" ,
>>>                               <options 
>>> xmlns="http://marklogic.com/appservices/search";>
>>>                                 
>>> <additional-query>{cts:directory-query("/logs/","infinity")}
>>>                                 </additional-query>
>>>                                 
>>> <searchable-expression>/logfile/log</searchable-expression>
>>>                               </options>
>>>                               , 1 , 10 )
>>> 
>>> 
>>> Profiling shows the majority of time is spent in xdmp:value
>>> 
>>> MarkLogic/appservices/utils/higher-order.xqy:   52
>>> xdmp:value($expr/hof:lambda/@expr)
>>> 1
>>> 100
>>> 124556878
>>> 100
>>> 124556880
>>> 
>>> 
>>> thats 124 seconds.  Everything else is noise (mostly under 100 us).
>>> 
>>> Any ideas ?
>>> 
>>> 
>>> ----------------------------------------
>>> David A. Lee
>>> Senior Principal Software Engineer
>>> Epocrates, Inc.
>>> [email protected]
>>> 812-482-5224
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Is "INFO" magic in search:search ? bad performance

Reply via email to