James,I can run a similar query rather quickly: 230 ms for 500 MedlineCitation fragments, with my laptop in powersave mode. I believe that the output is equivalent.
<result>{
for $key in distinct-values(
for $author in collection()/MedlineCitationSet/MedlineCitation
/Article/AuthorList/Author[ LastName ]
return string-join(($author/LastName, $author/ForeName), "|") )
let $name := tokenize($key, '|')
order by $key
return <author>{
element surname { $name[1] },
element fname { $name[2] } }</author>
}</result>
In this new query, the dominant expression is the rooted XPath, bringing
the MedlineCitation fragments into memory, plus calculating the
distinct-values of the pipe-delimited key. That's fine for small numbers
of fragments, but this approach requires rapidly-increasing amounts of
memory with large content sets. That's likely to be an issue with Saxon,
too.
To scale up, it's better to use a range index of type string. We can access its values via cts:element-values() or cts:element-attribute-values(). This approach can deliver your answers in milliseconds.
For your desired output, though, this would involve some content enrichment as well: perhaps by adding a 'key' attribute on every Author element. The new "Corb" tool on http://developer.marklogic.com/code/ is a good resource for this sort of enrichment, and the example medline-iso8601.xqy module is very close to what you'd need (http://developer.marklogic.com/svn/corb/trunk/src/java/com/marklogic/developer/corb/medline-iso8601.xqy).
-- Mike James A. Robinson wrote:
The question posed by Helen got me curious about the performance of
MarkLogic when it comes to dealing with trees built during the query.
I put together a set of 636 pubmed articles (so, a tiny set compared to
the amount Helen is dealing with). This set of articles contains 2847
author elements, 2634 of which are unique.
I used the XQuery I sent to the list and timed the results. The total
execution time was about 20 seconds! It looks like 19.7 or so of those
seconds are spent weeding out the unique authors -- it only takes the
server about .3 seconds to build the list of author names and the list
of unique author keys, and the rest of the time is spent looking at
the authors list for those unique author keys.
I timed this against Saxon, on the same machine, where Saxon loaded the
files up from disk. It took takes Saxon about 1.5 seconds to load the
files, but the amount of time to actually execute the query was only
about .1 to .2 seconds (so under 2 seconds total execution time).
This struck me as odd, I wouldn't have expected this much of a difference.
Since MarkLogic Server was blazingly fast at actually loading the
documents (which makes sense, xdmp:query-meters() shows the documents were
read from cache), I assume the difference is that Saxon is much better
at building an index for the temporary tree -- Is MarkLogic not doing
anything similar? If not, is there a technique one can use to force it
to? Is there some other way one should approach a manipulation like this?
I ask because this seemed like a typical sort of problem one might need
to solve in XQuery (when the documents don't have quite as grainular a
view as one needs it seems reasonable to assume one should be able to
build up the grainular representation as part of the query).
<result>{
let $authors :=
for $author in
collection()/MedlineCitationSet/MedlineCitation/Article/AuthorList/Author
let $surname := data($author/LastName)
let $fname := data($author/FirstName)
let $key := string-join(($surname,$fname), "|")
where exists($author/LastName)
return
<author key="{$key}">{
<surname>{$surname}</surname>,
if ($fname ne '') then <fname>{$fname}</fname> else ()
}</author>
let $unique :=
for $key in distinct-values($authors/@key)
order by $key
return $key
return
for $key in $unique
return <author>[EMAIL PROTECTED]/*}</author> (: the dreadfully slow part
... :)
}</result>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
James A. Robinson [EMAIL PROTECTED]
Stanford University HighWire Press http://highwire.stanford.edu/
+1 650 7237294 (Work) +1 650 7259335 (Fax)
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ General mailing list [email protected] http://xqzone.com/mailman/listinfo/general
