Hi, Eliot: One alternative to Geert's good suggestion -- if and only if the number of element names is small and you can create range indexes on them:
* add an element attribute range index on Article/@id * add an element range index on p * execute a cts:value-tuples() call with the constraining element query and directory query * iterate over the tuples, incrementing the value of the id in a map * remove the range index on p In MarkLogic 9, that approach gets simpler. You can just use TDE to project rows with columns for the id and element, group on the id column, and count the rows in the group. Hoping that's useful (and salutations in passing), Erik Hennum ________________________________________ From: [email protected] [[email protected]] on behalf of Geert Josten [[email protected]] Sent: Tuesday, May 23, 2017 12:53 AM To: MarkLogic Developer Discussion Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics Hi Eliot, I¹d consider using taskbot (http://registry.demo.marklogic.com/package/taskbot), and using that in combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It will make optimal use of the TaskServer of the host on which you initiate the call. It doesn¹t scale endlessly, but it batches up the work automatically for you, and will get you a lot further fairly easily.. Cheers, Geert On 5/23/17, 5:43 AM, "[email protected] on behalf of Eliot Kimber" <[email protected] on behalf of [email protected]> wrote: >I haven¹t yet seen anything in the docs that directly address what I¹m >trying to do and suspect I¹m simply missing some ML basics or just going >about things the wrong way. > >I have a corpus of several hundred thousand docs (but could be millions, >of course), where each doc is an average of 200K and several thousand >elements. > >I want to analyze the corpus to get details about the number of specific >subelements within each document, e.g.: > > >for $article in cts:search(/Article, cts:directory-query("/Default/", >"infinity"))[$start to $end] > return <article-counts id=²{$article/@id}² >paras=²{count($article//p}²/> > >I¹m running this as a query from Oxygen (so I can capture the results >locally so I can do other stuff with them). > >On the server I¹m using I blow the expanded tree cache if I try to >request more than about 20,000 docs. > >Is there a way to do this kind of processing over an arbitrarily large >set *and* get the results back from a single query request? > >I think the only solution is to write the results to back to the database >and then fetch that as the last thing but I was hoping there was >something simpler. > >Have I missed an obvious solution? > >Thanks, > >Eliot > >-- >Eliot Kimber >http://contrext.com > > > > >_______________________________________________ >General mailing list >[email protected] >Manage your subscription at: >http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
