Re: [MarkLogic Dev General] query tuning advice

Michael Blakeley Fri, 17 Jul 2009 10:41:52 -0700

Susan,

I may not understand your use-case and document structure correctly, butdoesn't this query have to access every available taxonomy document? Isay this because it lists out the isbn13, booktitle, and authorsortelements, but only authorsort comes from a range index. The query aswritten is actually a little worse than that, because it callscts:search once per author. If some documents have multiple authors, itwill access each document multiple times.

In the end, this query probably calls cts:search at least 1000 times.That would account for the bulk of the elapsed time. On average that'sless than 6-ms per cts:search, which really isn't bad for what it'sdoing. I think we can improve on that a little bit, but to get adramatic speed-up we need to re-think the problem.

How fast does this query need to run? Could you run this query wheneverthe relevant documents are updated, and store the results in a newdocument? If so, 5.5-sec might be acceptable because it would onlyhappen once, at the end of each batch of updates. Any user access to thestored version of the author-list would be much faster, since all thehard work has already been done.

That would be my preferred solution. But while you are thinking thatover, let's optimize the query a bit. In this case I don't think morerange indexes will help: there would still be around a thousand tripsthrough the database to build the page. Instead we can eliminate theextra document reads by making a single pass over all the taxonomy docs.We'll use a map to accumulate results in memory.


let $map := map:map()
let $build :=
  for $t in collection('abce')/taxonomy
  for $name as xs:string in $t/Authors/author/authorsort
  return map:put(
    $map, $name,
    (map:get($map, $name),
     element title { $t/isbn13, $t/booktitle } ) )
for $v in map:keys($map)
order by $v
return element heading {
  attribute type { 'author'},
  element name { $v },
  for $t in map:get($map, $v)
  return element title {
    attribute doc-id {$t/isbn13},
    $t/booktitle } }

That may look more complicated, but I think you'll find that it's faster- mostly because it's guaranteed to only touch each document once. Ifaked a test using some Medline documents, and this was about 5x fasterthan the baseline. It also doesn't require any range indexes.

Possibly it still isn't fast enough, and you've decided that caching theoutput in a new document won't work. If so, then I think you'll need touse co-occurrences. This gets tricky because you want three pieces ofinformation, but co-occurrence only supports two elements at a time. ButI think you can still manage it if you're willing to create a newelement that combines title and isbn13 in a delimited string.

Then you could build a range indexes on your new isbn13-title element(let's say it's comma-delimited) and usects:element-value-co-occurrences() to build the entire result setwithout any document fetches.


-- Mike

On 2009-07-17 07:02, Susan Basch wrote:

Hi all,

Apologies in advance for the length of this email . . .

I'm trying to generate a list unique authors with their associated titles from 
a taxonomy element that's added to each of our titles before it's loaded into 
Mark Logic.

The taxonomy looks something like this:

  <taxonomy>
  <booktitle>Daily Lives of Civilians in Wartime Twentieth-Century 
Europe</booktitle>
  <booktitle_sort>Daily Lives of Civilians in Wartime Twentieth-Century 
Europe</booktitle_sort>
  ...
  <Authors>
   <author authorId="130670">
    <authorsort>Atkin, Nicholas</authorsort>
    <firstname>Nicholas</firstname>
    <middlename /><lastname>Atkin</lastname>
    <role>Author</role><rank>1</rank>
   </author>
  </Authors>
  </taxonomy>

There's an element-range-index on the authorsort and booktitle_sort elements.  
There can be more than one author element.

And the query (so far) looks something like this:


  for $v in cts:element-values(
     xs:QName('authorsort'),
     (), (),
     cts:collection-query('abce'))

  return

   element heading {
       attribute type { 'author'},
       element name {$v},

       let $titles :=
         ( cts:search(collection("abce")//taxonomy, 
cts:element-value-query(xs:QName('authorsort'), $v), 'unfiltered') )

           for $title in $titles
             return element title {
               attribute doc-id {$title/isbn13},
               $title/booktitle}
             }

This approach seems to work fairly well with the element range indexes on our 
subject and date taxonomy elements, but is just too slow when it comes to the 
authors.

Here's an excerpt from xdmp:query-trace:

xdmp:eval("(: browse testing :)&#13;&#10;xquery version&quot;1.0-ml&quot;;...", (),<options 
xmlns="xdmp:eval"><database>7839305530622276384</database><modules>0</modules><def...</options>)
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Analyzing path for search: 
collection("abce")/descendant::taxonomy
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 1 is searchable: 
collection("abce")
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 2 is searchable: 
descendant::taxonomy
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Path is fully searchable.
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Gathering constraints.
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 1 contributed 1 constraint: 
collection("abce")
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 2 test contributed 1 
constraint: taxonomy
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Comparison contributed string range 
value constraint: authorsort = "Zimmerman, Joseph F."
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Search query contributed 1 constraint: 
cts:element-range-query(QName("", "authorsort"), "=", "Zimmerman, Joseph F.", 
("collation=http://marklogic.com/collation/";), 1)
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Executing search.
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Selected 4 fragments

<qm:query-meters xsi:schemaLocation="http://marklogic.com/xdmp/query-meters query-meters.xsd" 
xmlns:qm="http://marklogic.com/xdmp/query-meters"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>
   <qm:elapsed-time>PT5.438S</qm:elapsed-time>
   <qm:requests>0</qm:requests>
   <qm:list-cache-hits>215013</qm:list-cache-hits>
   <qm:list-cache-misses>0</qm:list-cache-misses>
   <qm:in-memory-list-hits>0</qm:in-memory-list-hits>
   <qm:expanded-tree-cache-hits>2205</qm:expanded-tree-cache-hits>
   <qm:expanded-tree-cache-misses>4554</qm:expanded-tree-cache-misses>
   <qm:compressed-tree-cache-hits>4554</qm:compressed-tree-cache-hits>
   <qm:compressed-tree-cache-misses>0</qm:compressed-tree-cache-misses>
   <qm:in-memory-compressed-tree-hits>0</qm:in-memory-compressed-tree-hits>
   <qm:value-cache-hits>0</qm:value-cache-hits>
   <qm:value-cache-misses>0</qm:value-cache-misses>
   ...
   <qm:document>
   <qm:uri>/abce/C9129.xml</qm:uri>
   <qm:expanded-tree-cache-hits>0</qm:expanded-tree-cache-hits>
   <qm:expanded-tree-cache-misses>1</qm:expanded-tree-cache-misses>
   </qm:document>

I seem to be getting a lot of expanded-tree-cache-misses, but I'm not sure how 
to correct for that.

Is there a more efficient way to generate our list of authors?

Thanks!

Susan


_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] query tuning advice

Reply via email to