Susan,

I may not understand your use-case and document structure correctly, but doesn't this query have to access every available taxonomy document? I say this because it lists out the isbn13, booktitle, and authorsort elements, but only authorsort comes from a range index. The query as written is actually a little worse than that, because it calls cts:search once per author. If some documents have multiple authors, it will access each document multiple times.

In the end, this query probably calls cts:search at least 1000 times. That would account for the bulk of the elapsed time. On average that's less than 6-ms per cts:search, which really isn't bad for what it's doing. I think we can improve on that a little bit, but to get a dramatic speed-up we need to re-think the problem.

How fast does this query need to run? Could you run this query whenever the relevant documents are updated, and store the results in a new document? If so, 5.5-sec might be acceptable because it would only happen once, at the end of each batch of updates. Any user access to the stored version of the author-list would be much faster, since all the hard work has already been done.

That would be my preferred solution. But while you are thinking that over, let's optimize the query a bit. In this case I don't think more range indexes will help: there would still be around a thousand trips through the database to build the page. Instead we can eliminate the extra document reads by making a single pass over all the taxonomy docs. We'll use a map to accumulate results in memory.

let $map := map:map()
let $build :=
  for $t in collection('abce')/taxonomy
  for $name as xs:string in $t/Authors/author/authorsort
  return map:put(
    $map, $name,
    (map:get($map, $name),
     element title { $t/isbn13, $t/booktitle } ) )
for $v in map:keys($map)
order by $v
return element heading {
  attribute type { 'author'},
  element name { $v },
  for $t in map:get($map, $v)
  return element title {
    attribute doc-id {$t/isbn13},
    $t/booktitle } }

That may look more complicated, but I think you'll find that it's faster - mostly because it's guaranteed to only touch each document once. I faked a test using some Medline documents, and this was about 5x faster than the baseline. It also doesn't require any range indexes.

Possibly it still isn't fast enough, and you've decided that caching the output in a new document won't work. If so, then I think you'll need to use co-occurrences. This gets tricky because you want three pieces of information, but co-occurrence only supports two elements at a time. But I think you can still manage it if you're willing to create a new element that combines title and isbn13 in a delimited string.

Then you could build a range indexes on your new isbn13-title element (let's say it's comma-delimited) and use cts:element-value-co-occurrences() to build the entire result set without any document fetches.

-- Mike

On 2009-07-17 07:02, Susan Basch wrote:
Hi all,

Apologies in advance for the length of this email . . .

I'm trying to generate a list unique authors with their associated titles from 
a taxonomy element that's added to each of our titles before it's loaded into 
Mark Logic.

The taxonomy looks something like this:

  <taxonomy>
  <booktitle>Daily Lives of Civilians in Wartime Twentieth-Century 
Europe</booktitle>
  <booktitle_sort>Daily Lives of Civilians in Wartime Twentieth-Century 
Europe</booktitle_sort>
  ...
  <Authors>
   <author authorId="130670">
    <authorsort>Atkin, Nicholas</authorsort>
    <firstname>Nicholas</firstname>
    <middlename /><lastname>Atkin</lastname>
    <role>Author</role><rank>1</rank>
   </author>
  </Authors>
  </taxonomy>

There's an element-range-index on the authorsort and booktitle_sort elements.  
There can be more than one author element.

And the query (so far) looks something like this:


  for $v in cts:element-values(
     xs:QName('authorsort'),
     (), (),
     cts:collection-query('abce'))

  return

   element heading {
       attribute type { 'author'},
       element name {$v},

       let $titles :=
         ( cts:search(collection("abce")//taxonomy, 
cts:element-value-query(xs:QName('authorsort'), $v), 'unfiltered') )

           for $title in $titles
             return element title {
               attribute doc-id {$title/isbn13},
               $title/booktitle}
             }

This approach seems to work fairly well with the element range indexes on our 
subject and date taxonomy elements, but is just too slow when it comes to the 
authors.

Here's an excerpt from xdmp:query-trace:

xdmp:eval("(: browse testing :)&#13;&#10;xquery version&quot;1.0-ml&quot;;...", (),<options 
xmlns="xdmp:eval"><database>7839305530622276384</database><modules>0</modules><def...</options>)
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Analyzing path for search: 
collection("abce")/descendant::taxonomy
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 1 is searchable: 
collection("abce")
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 2 is searchable: 
descendant::taxonomy
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Path is fully searchable.
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Gathering constraints.
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 1 contributed 1 constraint: 
collection("abce")
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 2 test contributed 1 
constraint: taxonomy
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Comparison contributed string range 
value constraint: authorsort = "Zimmerman, Joseph F."
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Search query contributed 1 constraint: 
cts:element-range-query(QName("", "authorsort"), "=", "Zimmerman, Joseph F.", 
("collation=http://marklogic.com/collation/";), 1)
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Executing search.
2009-07-17 05:13:42.584 Info: 8002-research: line 48: Selected 4 fragments

<qm:query-meters xsi:schemaLocation="http://marklogic.com/xdmp/query-meters query-meters.xsd" 
xmlns:qm="http://marklogic.com/xdmp/query-meters"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>
   <qm:elapsed-time>PT5.438S</qm:elapsed-time>
   <qm:requests>0</qm:requests>
   <qm:list-cache-hits>215013</qm:list-cache-hits>
   <qm:list-cache-misses>0</qm:list-cache-misses>
   <qm:in-memory-list-hits>0</qm:in-memory-list-hits>
   <qm:expanded-tree-cache-hits>2205</qm:expanded-tree-cache-hits>
   <qm:expanded-tree-cache-misses>4554</qm:expanded-tree-cache-misses>
   <qm:compressed-tree-cache-hits>4554</qm:compressed-tree-cache-hits>
   <qm:compressed-tree-cache-misses>0</qm:compressed-tree-cache-misses>
   <qm:in-memory-compressed-tree-hits>0</qm:in-memory-compressed-tree-hits>
   <qm:value-cache-hits>0</qm:value-cache-hits>
   <qm:value-cache-misses>0</qm:value-cache-misses>
   ...
   <qm:document>
   <qm:uri>/abce/C9129.xml</qm:uri>
   <qm:expanded-tree-cache-hits>0</qm:expanded-tree-cache-hits>
   <qm:expanded-tree-cache-misses>1</qm:expanded-tree-cache-misses>
   </qm:document>

I seem to be getting a lot of expanded-tree-cache-misses, but I'm not sure how 
to correct for that.

Is there a more efficient way to generate our list of authors?

Thanks!

Susan





_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to