Perfect! The co-occurrences approach with a new combined element is just what we needed -- thanks for pointing me in the right direction.
On Fri, Jul 17, 2009 at 1:41 PM, Michael Blakeley < [email protected]> wrote: > Susan, > > I may not understand your use-case and document structure correctly, but > doesn't this query have to access every available taxonomy document? I say > this because it lists out the isbn13, booktitle, and authorsort elements, > but only authorsort comes from a range index. The query as written is > actually a little worse than that, because it calls cts:search once per > author. If some documents have multiple authors, it will access each > document multiple times. > > In the end, this query probably calls cts:search at least 1000 times. That > would account for the bulk of the elapsed time. On average that's less than > 6-ms per cts:search, which really isn't bad for what it's doing. I think we > can improve on that a little bit, but to get a dramatic speed-up we need to > re-think the problem. > > How fast does this query need to run? Could you run this query whenever the > relevant documents are updated, and store the results in a new document? If > so, 5.5-sec might be acceptable because it would only happen once, at the > end of each batch of updates. Any user access to the stored version of the > author-list would be much faster, since all the hard work has already been > done. > > That would be my preferred solution. But while you are thinking that over, > let's optimize the query a bit. In this case I don't think more range > indexes will help: there would still be around a thousand trips through the > database to build the page. Instead we can eliminate the extra document > reads by making a single pass over all the taxonomy docs. We'll use a map to > accumulate results in memory. > > let $map := map:map() > let $build := > for $t in collection('abce')/taxonomy > for $name as xs:string in $t/Authors/author/authorsort > return map:put( > $map, $name, > (map:get($map, $name), > element title { $t/isbn13, $t/booktitle } ) ) > for $v in map:keys($map) > order by $v > return element heading { > attribute type { 'author'}, > element name { $v }, > for $t in map:get($map, $v) > return element title { > attribute doc-id {$t/isbn13}, > $t/booktitle } } > > That may look more complicated, but I think you'll find that it's faster - > mostly because it's guaranteed to only touch each document once. I faked a > test using some Medline documents, and this was about 5x faster than the > baseline. It also doesn't require any range indexes. > > Possibly it still isn't fast enough, and you've decided that caching the > output in a new document won't work. If so, then I think you'll need to use > co-occurrences. This gets tricky because you want three pieces of > information, but co-occurrence only supports two elements at a time. But I > think you can still manage it if you're willing to create a new element that > combines title and isbn13 in a delimited string. > > Then you could build a range indexes on your new isbn13-title element > (let's say it's comma-delimited) and use cts:element-value-co-occurrences() > to build the entire result set without any document fetches. > > -- Mike > > > On 2009-07-17 07:02, Susan Basch wrote: > >> Hi all, >> >> Apologies in advance for the length of this email . . . >> >> I'm trying to generate a list unique authors with their associated titles >> from a taxonomy element that's added to each of our titles before it's >> loaded into Mark Logic. >> >> The taxonomy looks something like this: >> >> <taxonomy> >> <booktitle>Daily Lives of Civilians in Wartime Twentieth-Century >> Europe</booktitle> >> <booktitle_sort>Daily Lives of Civilians in Wartime Twentieth-Century >> Europe</booktitle_sort> >> ... >> <Authors> >> <author authorId="130670"> >> <authorsort>Atkin, Nicholas</authorsort> >> <firstname>Nicholas</firstname> >> <middlename /><lastname>Atkin</lastname> >> <role>Author</role><rank>1</rank> >> </author> >> </Authors> >> </taxonomy> >> >> There's an element-range-index on the authorsort and booktitle_sort >> elements. There can be more than one author element. >> >> And the query (so far) looks something like this: >> >> >> for $v in cts:element-values( >> xs:QName('authorsort'), >> (), (), >> cts:collection-query('abce')) >> >> return >> >> element heading { >> attribute type { 'author'}, >> element name {$v}, >> >> let $titles := >> ( cts:search(collection("abce")//taxonomy, >> cts:element-value-query(xs:QName('authorsort'), $v), 'unfiltered') ) >> >> for $title in $titles >> return element title { >> attribute doc-id {$title/isbn13}, >> $title/booktitle} >> } >> >> This approach seems to work fairly well with the element range indexes on >> our subject and date taxonomy elements, but is just too slow when it comes >> to the authors. >> >> Here's an excerpt from xdmp:query-trace: >> >> xdmp:eval("(: browse testing :) xquery >> version"1.0-ml";...", (),<options >> xmlns="xdmp:eval"><database>7839305530622276384</database><modules>0</modules><def...</options>) >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Analyzing path for >> search: collection("abce")/descendant::taxonomy >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 1 is >> searchable: collection("abce") >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 2 is >> searchable: descendant::taxonomy >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Path is fully >> searchable. >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Gathering >> constraints. >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 1 contributed 1 >> constraint: collection("abce") >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Step 2 test >> contributed 1 constraint: taxonomy >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Comparison >> contributed string range value constraint: authorsort = "Zimmerman, Joseph >> F." >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Search query >> contributed 1 constraint: cts:element-range-query(QName("", "authorsort"), >> "=", "Zimmerman, Joseph F.", ("collation=http://marklogic.com/collation/"), >> 1) >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Executing search. >> 2009-07-17 05:13:42.584 Info: 8002-research: line 48: Selected 4 fragments >> >> <qm:query-meters xsi:schemaLocation=" >> http://marklogic.com/xdmp/query-meters query-meters.xsd" xmlns:qm=" >> http://marklogic.com/xdmp/query-meters" xmlns:xsi=" >> http://www.w3.org/2001/XMLSchema-instance"> >> <qm:elapsed-time>PT5.438S</qm:elapsed-time> >> <qm:requests>0</qm:requests> >> <qm:list-cache-hits>215013</qm:list-cache-hits> >> <qm:list-cache-misses>0</qm:list-cache-misses> >> <qm:in-memory-list-hits>0</qm:in-memory-list-hits> >> <qm:expanded-tree-cache-hits>2205</qm:expanded-tree-cache-hits> >> <qm:expanded-tree-cache-misses>4554</qm:expanded-tree-cache-misses> >> <qm:compressed-tree-cache-hits>4554</qm:compressed-tree-cache-hits> >> <qm:compressed-tree-cache-misses>0</qm:compressed-tree-cache-misses> >> <qm:in-memory-compressed-tree-hits>0</qm:in-memory-compressed-tree-hits> >> <qm:value-cache-hits>0</qm:value-cache-hits> >> <qm:value-cache-misses>0</qm:value-cache-misses> >> ... >> <qm:document> >> <qm:uri>/abce/C9129.xml</qm:uri> >> <qm:expanded-tree-cache-hits>0</qm:expanded-tree-cache-hits> >> <qm:expanded-tree-cache-misses>1</qm:expanded-tree-cache-misses> >> </qm:document> >> >> I seem to be getting a lot of expanded-tree-cache-misses, but I'm not sure >> how to correct for that. >> >> Is there a more efficient way to generate our list of authors? >> >> Thanks! >> >> Susan >> >> >> >> >> > _______________________________________________ > General mailing list > [email protected] > http://xqzone.com/mailman/listinfo/general >
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
