Perfect!  The co-occurrences approach with a new combined element is just
what we needed -- thanks for pointing me in the right direction.


On Fri, Jul 17, 2009 at 1:41 PM, Michael Blakeley <
[email protected]> wrote:

> Susan,
>
> I may not understand your use-case and document structure correctly, but
> doesn't this query have to access every available taxonomy document? I say
> this because it lists out the isbn13, booktitle, and authorsort elements,
> but only authorsort comes from a range index. The query as written is
> actually a little worse than that, because it calls cts:search once per
> author. If some documents have multiple authors, it will access each
> document multiple times.
>
> In the end, this query probably calls cts:search at least 1000 times. That
> would account for the bulk of the elapsed time. On average that's less than
> 6-ms per cts:search, which really isn't bad for what it's doing. I think we
> can improve on that a little bit, but to get a dramatic speed-up we need to
> re-think the problem.
>
> How fast does this query need to run? Could you run this query whenever the
> relevant documents are updated, and store the results in a new document? If
> so, 5.5-sec might be acceptable because it would only happen once, at the
> end of each batch of updates. Any user access to the stored version of the
> author-list would be much faster, since all the hard work has already been
> done.
>
> That would be my preferred solution. But while you are thinking that over,
> let's optimize the query a bit. In this case I don't think more range
> indexes will help: there would still be around a thousand trips through the
> database to build the page. Instead we can eliminate the extra document
> reads by making a single pass over all the taxonomy docs. We'll use a map to
> accumulate results in memory.
>
> let $map := map:map()
> let $build :=
>  for $t in collection('abce')/taxonomy
>  for $name as xs:string in $t/Authors/author/authorsort
>  return map:put(
>    $map, $name,
>    (map:get($map, $name),
>     element title { $t/isbn13, $t/booktitle } ) )
> for $v in map:keys($map)
> order by $v
> return element heading {
>  attribute type { 'author'},
>  element name { $v },
>  for $t in map:get($map, $v)
>  return element title {
>    attribute doc-id {$t/isbn13},
>    $t/booktitle } }
>
> That may look more complicated, but I think you'll find that it's faster -
> mostly because it's guaranteed to only touch each document once. I faked a
> test using some Medline documents, and this was about 5x faster than the
> baseline. It also doesn't require any range indexes.
>
> Possibly it still isn't fast enough, and you've decided that caching the
> output in a new document won't work. If so, then I think you'll need to use
> co-occurrences. This gets tricky because you want three pieces of
> information, but co-occurrence only supports two elements at a time. But I
> think you can still manage it if you're willing to create a new element that
> combines title and isbn13 in a delimited string.
>
> Then you could build a range indexes on your new isbn13-title element
> (let's say it's comma-delimited) and use cts:element-value-co-occurrences()
> to build the entire result set without any document fetches.
>
> -- Mike
>
>
> On 2009-07-17 07:02, Susan Basch wrote:
>
>> Hi all,
>>
>> Apologies in advance for the length of this email . . .
>>
>> I'm trying to generate a list unique authors with their associated titles
>> from a taxonomy element that's added to each of our titles before it's
>> loaded into Mark Logic.
>>
>> The taxonomy looks something like this:
>>
>>  <taxonomy>
>>  <booktitle>Daily Lives of Civilians in Wartime Twentieth-Century
>> Europe</booktitle>
>>  <booktitle_sort>Daily Lives of Civilians in Wartime Twentieth-Century
>> Europe</booktitle_sort>
>>  ...
>>  <Authors>
>>   <author authorId="130670">
>>    <authorsort>Atkin, Nicholas</authorsort>
>>    <firstname>Nicholas</firstname>
>>    <middlename /><lastname>Atkin</lastname>
>>    <role>Author</role><rank>1</rank>
>>   </author>
>>  </Authors>
>>  </taxonomy>
>>
>> There's an element-range-index on the authorsort and booktitle_sort
>> elements.  There can be more than one author element.
>>
>> And the query (so far) looks something like this:
>>
>>
>>  for $v in cts:element-values(
>>     xs:QName('authorsort'),
>>     (), (),
>>     cts:collection-query('abce'))
>>
>>  return
>>
>>   element heading {
>>       attribute type { 'author'},
>>       element name {$v},
>>
>>       let $titles :=
>>         ( cts:search(collection("abce")//taxonomy,
>> cts:element-value-query(xs:QName('authorsort'), $v), 'unfiltered') )
>>
>>           for $title in $titles
>>             return element title {
>>               attribute doc-id {$title/isbn13},
>>               $title/booktitle}
>>             }
>>
>> This approach seems to work fairly well with the element range indexes on
>> our subject and date taxonomy elements, but is just too slow when it comes
>> to the authors.
>>
>> Here's an excerpt from xdmp:query-trace:
>>
>> xdmp:eval("(: browse testing :)&#13;&#10;xquery
>> version&quot;1.0-ml&quot;;...", (),<options
>> xmlns="xdmp:eval"><database>7839305530622276384</database><modules>0</modules><def...</options>)
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Analyzing path for
>> search: collection("abce")/descendant::taxonomy
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 1 is
>> searchable: collection("abce")
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 2 is
>> searchable: descendant::taxonomy
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Path is fully
>> searchable.
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Gathering
>> constraints.
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 1 contributed 1
>> constraint: collection("abce")
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 2 test
>> contributed 1 constraint: taxonomy
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Comparison
>> contributed string range value constraint: authorsort = "Zimmerman, Joseph
>> F."
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Search query
>> contributed 1 constraint: cts:element-range-query(QName("", "authorsort"),
>> "=", "Zimmerman, Joseph F.", ("collation=http://marklogic.com/collation/";),
>> 1)
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Executing search.
>> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Selected 4 fragments
>>
>> <qm:query-meters xsi:schemaLocation="
>> http://marklogic.com/xdmp/query-meters query-meters.xsd" xmlns:qm="
>> http://marklogic.com/xdmp/query-meters"; xmlns:xsi="
>> http://www.w3.org/2001/XMLSchema-instance";>
>>   <qm:elapsed-time>PT5.438S</qm:elapsed-time>
>>   <qm:requests>0</qm:requests>
>>   <qm:list-cache-hits>215013</qm:list-cache-hits>
>>   <qm:list-cache-misses>0</qm:list-cache-misses>
>>   <qm:in-memory-list-hits>0</qm:in-memory-list-hits>
>>   <qm:expanded-tree-cache-hits>2205</qm:expanded-tree-cache-hits>
>>   <qm:expanded-tree-cache-misses>4554</qm:expanded-tree-cache-misses>
>>   <qm:compressed-tree-cache-hits>4554</qm:compressed-tree-cache-hits>
>>   <qm:compressed-tree-cache-misses>0</qm:compressed-tree-cache-misses>
>>   <qm:in-memory-compressed-tree-hits>0</qm:in-memory-compressed-tree-hits>
>>   <qm:value-cache-hits>0</qm:value-cache-hits>
>>   <qm:value-cache-misses>0</qm:value-cache-misses>
>>   ...
>>   <qm:document>
>>   <qm:uri>/abce/C9129.xml</qm:uri>
>>   <qm:expanded-tree-cache-hits>0</qm:expanded-tree-cache-hits>
>>   <qm:expanded-tree-cache-misses>1</qm:expanded-tree-cache-misses>
>>   </qm:document>
>>
>> I seem to be getting a lot of expanded-tree-cache-misses, but I'm not sure
>> how to correct for that.
>>
>> Is there a more efficient way to generate our list of authors?
>>
>> Thanks!
>>
>> Susan
>>
>>
>>
>>
>>
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
>
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to