I ran some profiles as well, I think your profile results could have been a bit misleading. I suspect walking over uris, and summing frequencies is one of the major slow parts. I managed to create a UDF though (my first one). It is relatively generic, and should allow summing frequencies of any elem/attrib. You can download it from here:
https://github.com/grtjn/doc-count-udf Get/clone it, run `make` (pref on the target env), and follow instructions to install it. After that you can run: let $uris := cts:aggregate( "gjosten/doc-count", "doc-count", ( cts:uri-reference(), cts:element-attribute-reference(xs:QName("file"), xs:QName("size")) ) ) let $counts := -$uris let $top-keys := for $key in map:keys($counts) order by xs:int($key) descending return $key return ( for $key in $top-keys for $value in map:get($counts, $key) return $value || " - " || $key )[1 to 10] I tested with 1k docs, and my earlier tuples approach took 14 sec with that, less than 1 sec with this.. Cheers, Geert From: Johan Mörén <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Sunday, June 28, 2015 at 12:07 AM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] Find the document(s) with max occurrences of an element-attribute reference Thanks again for looking into this Geert! I tried a mix of your approach (minus the -$uris part) and mine and got better results. But that will not give me the ability to sort the whole database based on occurrence. Just got me the document(s) with the maximum number of occurrences. I tried this query in production where we have 1.4 million documents and the total number of file-elements is roughly 25 million. Got the result back in about 3 minutes. So it was definitely an improvement. But it will not scale over time. Thanks for looking down the UDF path. Hopefully this could lead to a more general an useful approach. Cheers, Johan On Sat, Jun 27, 2015 at 8:06 PM Geert Josten <[email protected]<mailto:[email protected]>> wrote: My approach was similar, but tried to sum all frequencies per uri. Unfortunately, that approach gets slower with more documents, and more distinct file sizes. Adding a simple count attribute or element in the file somewhere would greatly simplify the run-time calculation, and that is what I would normally recommend. For the sake of completeness I’ll give it some more thought to see if there are ways to improve on the 3 minutes. A UDF might be useful, would have to try that.. Cheers, Geert From: Johan Mörén <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Saturday, June 27, 2015 at 1:23 AM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] Find the document(s) with max occurrences of an element-attribute reference Hi Christopher I tried your approach but still without success. I think the case might be that your example is using a fixed vale for size ("yes"). And since frequency is based on the the value you get the right results. Regards, Johan On Sat, Jun 27, 2015 at 12:34 AM Christopher Hamlin <[email protected]<mailto:[email protected]>> wrote: Hi Johan, Maybe I'm not clear on what you want. I just tried something. I created documents in a database using xquery version "1.0-ml"; for $i in 1 to 100 let $doc := <doc>{(1 to $i)!<file size='yes'/>}</doc> let $uri := '/'||$i||'.xml' return xdmp:document-insert ($uri, $doc) so for example /1.xml => <doc> <file size="yes"/> </doc> and /2.xml => <doc> <file size="yes"/> <file size="yes"/> </doc> and so on. With a file/@size element-attribute range index, the query xquery version '1.0-ml'; let $uris := cts:uri-reference() let $ea := cts:element-attribute-reference (xs:QName ('file'), xs:QName ('size'), 'collation=http://marklogic.com/collation/codepoint') return for $tuple in cts:value-tuples(($uris, $ea), ('item-frequency','frequency-order','descending','limit=3')) return fn:concat ($tuple[1], ' -> ', cts:frequency ($tuple)) returns /100.xml -> 100 /99.xml -> 99 /98.xml -> 98 /97.xml -> 97 /96.xml -> 96 /95.xml -> 95 /94.xml -> 94 /93.xml -> 93 /92.xml -> 92 /91.xml -> 91 Is this close to what you want? Regards, Chris On Fri, Jun 26, 2015 at 12:41 PM, Johan Mörén <[email protected]<mailto:[email protected]>> wrote: > Hi Christopher! > > I'm not sure where you wan't me to use these options. But i tried to add > them to the cts:value-tuples() but that did not return the expected result. > > like this > > ... > for $tuple in > cts:value-tuples( > ( > cts:uri-reference(), > $sizeRef > ), > ("frequency-order","descending","limit=10") > > ) > ... > > Regards, > Johan > > On Fri, Jun 26, 2015 at 5:58 PM Christopher Hamlin > <[email protected]<mailto:[email protected]>> > wrote: >> >> If you just want something like top ten, I think it's more direct >> possibly. >> >> Can you try returning frequency-order, descending, limit=10? Are those >> options you can use? >> >> _______________________________________________ >> General mailing list >> [email protected]<mailto:[email protected]> >> Manage your subscription at: >> http://developer.marklogic.com/mailman/listinfo/general > > > _______________________________________________ > General mailing list > [email protected]<mailto:[email protected]> > Manage your subscription at: > http://developer.marklogic.com/mailman/listinfo/general > _______________________________________________ General mailing list [email protected]<mailto:[email protected]> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected]<mailto:[email protected]> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
