[MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Eliot Kimber Mon, 22 May 2017 20:44:03 -0700

I haven’t yet seen anything in the docs that directly address what I’m trying 
to do and suspect I’m simply missing some ML basics or just going about things 
the wrong way.


I have a corpus of several hundred thousand docs (but could be millions, of 
course), where each doc is an average of 200K and several thousand elements.

I want to analyze the corpus to get details about the number of specific 
subelements within each document, e.g.:


for $article in cts:search(/Article, cts:directory-query("/Default/", 
"infinity"))[$start to $end]
     return <article-counts id=”{$article/@id}” paras=”{count($article//p}”/>

I’m running this as a query from Oxygen (so I can capture the results locally 
so I can do other stuff with them).

On the server I’m using I blow the expanded tree cache if I try to request more 
than about 20,000 docs.

Is there a way to do this kind of processing over an arbitrarily large set 
*and* get the results back from a single query request?

I think the only solution is to write the results to back to the database and 
then fetch that as the last thing but I was hoping there was something simpler.

Have I missed an obvious solution?

Thanks,

Eliot

--
Eliot Kimber
http://contrext.com
 



_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Reply via email to