Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Erik Hennum Tue, 23 May 2017 06:21:50 -0700

Hi, Eliot:

One alternative to Geert's good suggestion -- if and only if the number 
of element names is small and you can create range indexes on them:


*  add an element attribute range index on Article/@id
*  add an element range index on p
*  execute a cts:value-tuples() call with the constraining element query and 
directory query
*  iterate over the tuples, incrementing the value of the id in a map
*  remove the range index on p

In MarkLogic 9, that approach gets simpler.  You can just use TDE
to project rows with columns for the id and element, group on 
the id column, and count the rows in the group.

Hoping that's useful (and salutations in passing),


Erik Hennum

________________________________________
From: [email protected] 
[[email protected]] on behalf of Geert Josten 
[[email protected]]
Sent: Tuesday, May 23, 2017 12:53 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Processing Large Number of Docs to Get 
Statistics

Hi Eliot,

I¹d consider using taskbot
(http://registry.demo.marklogic.com/package/taskbot), and using that in
combination with either $tb:OPTIONS-SYNC or $tb:OPTIONS-SYNC-UPDATE. It
will make optimal use of the TaskServer of the host on which you initiate
the call. It doesn¹t scale endlessly, but it batches up the work
automatically for you, and will get you a lot further fairly easily..

Cheers,
Geert

On 5/23/17, 5:43 AM, "[email protected] on behalf of
Eliot Kimber" <[email protected] on behalf of
[email protected]> wrote:

>I haven¹t yet seen anything in the docs that directly address what I¹m
>trying to do and suspect I¹m simply missing some ML basics or just going
>about things the wrong way.
>
>I have a corpus of several hundred thousand docs (but could be millions,
>of course), where each doc is an average of 200K and several thousand
>elements.
>
>I want to analyze the corpus to get details about the number of specific
>subelements within each document, e.g.:
>
>
>for $article in cts:search(/Article, cts:directory-query("/Default/",
>"infinity"))[$start to $end]
>     return <article-counts id=²{$article/@id}²
>paras=²{count($article//p}²/>
>
>I¹m running this as a query from Oxygen (so I can capture the results
>locally so I can do other stuff with them).
>
>On the server I¹m using I blow the expanded tree cache if I try to
>request more than about 20,000 docs.
>
>Is there a way to do this kind of processing over an arbitrarily large
>set *and* get the results back from a single query request?
>
>I think the only solution is to write the results to back to the database
>and then fetch that as the last thing but I was hoping there was
>something simpler.
>
>Have I missed an obvious solution?
>
>Thanks,
>
>Eliot
>
>--
>Eliot Kimber
>http://contrext.com
>
>
>
>
>_______________________________________________
>General mailing list
>[email protected]
>Manage your subscription at:
>http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Reply via email to