Hi Johan,
Your approach will do n+1 index lookups, where n is the number of documents you
have. It will be more efficient to do just one co-occurrence lookup, and then
do in memory calculation. Here a bit of code that generates sample docs, does a
count the classical way (for comparison), and an approach that leverages a
range index on file/@size:
xquery version "1.0-ml";
for $i in (1 to 100)
let $xml := <mets id="{$i}">{
for $j in (1 to xdmp:random($i))
return <file size="{$j}"/>
}</mets>
return
xdmp:document-insert("/mets/file"||$i||".xml", $xml, (), "mets")
;
(
for $doc in collection()
let $size-count := count($doc//@size)
order by $size-count descending
return xdmp:node-uri($doc) || " - " || $size-count
)[1 to 10]
;
"--------"
;
let $uris := map:map()
let $_ :=
for $tuple in
cts:value-tuples(
(
cts:uri-reference(),
cts:element-attribute-reference(xs:QName("file"), xs:QName("size"))
)
)
let $uri := $tuple[1]
let $value := $tuple[2]
let $freq := cts:frequency($tuple)
return map:put($uris, $uri, sum((map:get($uris, $uri), $freq)))
let $counts := -$uris
let $top-keys :=
for $key in map:keys($counts)
order by xs:int($key) descending
return $key
return (
for $key in $top-keys
for $value in map:get($counts, $key)
return $value || " - " || $key
)[1 to 10]
I don’t have enough volume to properly test performance, but I’d expect that
the cts:value-tuples approach should run within seconds..
Cheers,
Geert
From: Johan Mörén <[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion
<[email protected]<mailto:[email protected]>>
Date: Friday, June 26, 2015 at 12:07 PM
To: MarkLogic Developer Discussion
<[email protected]<mailto:[email protected]>>
Subject: [MarkLogic Dev General] Find the document(s) with max occurrences of
an element-attribute reference
Hi!
I want to find out which document(s) that have the highest number of
occurrences of a particular element-attribute regardless of its value. In this
case //mets:file/@SIZE
I have an range index on this, and operations like getting the count, sum, avg,
min and max works fine and are very performant. But they operate on the whole
database or a subset constrained by another cts:query.
The included code gives me what i want but it is not very performant.
Running this query on a database with 300k documents returns in 1 minute and 25
seconds.
Is there a better way to solve this apart from including the count of
mets:file/@SIZE as a separate element in each document?
--- code --
xquery version "1.0-ml";
declare namespace mets="http://www.loc.gov/METS/";
declare variable $sizeRef :=
cts:element-attribute-reference(xs:QName("mets:file"), xs:QName("SIZE"));
declare function local:updateMax($theMap, $size, $uri) {
let $currentMax := (map:get($theMap, "max"),0)[1]
let $currentUris := (map:get($theMap, "uris"))
return if($size > $currentMax) then (
map:put($theMap, "max", $size),
map:put($theMap, "uris", $uri)
) else if ($size = $currentMax) then (
map:put($theMap, "uris", ($currentUris, $uri))
) else ()
};
let $map := map:new((
map:entry("max",0),
map:entry("uris",())
))
let $puts := for $uri in cts:uris()
return local:updateMax($map, cts:count-aggregate($sizeRef,(),
cts:document-query($uri)), $uri)
return $map
--- end code ---
Regards,
Johan Mörén
National Library of Sweden
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general