I got what I needed by creating a simple groovy script that uses the XCC
library to submit queries. Script is below. My main discovery was that I need
to create a new session for every iteration to avoid connection time outs. With
this I was able to process several 100 thousand docs and accumulate the results
on my local machine. My command line is:
groovy -cp lib/xcc.jar GetArticleMetadataDetails.groovy
I chose groovy because it supports Java libraries directly and makes it easy to
script things.
Groovy script:
#!/usr/bin/env groovy
/*
* Use XCC jar to run enrichment jobs and collect the results.
*/
import com.marklogic.xcc.*;
import com.marklogic.xcc.types.*;
ContentSource source = ContentSourceFactory.newContentSource("myserver", 1984,
"user", "pw");
RequestOptions options = new RequestOptions();
options.setRequestTimeLimit(3600)
moduleUrl = "rq-metadata-analysis.xqy"
println "Running module ${moduleUrl}..."
println new Date()
File outfile = new File("query-result.xml")
outfile.write "<article-metadata-counts>\n";
(36..56).each { index ->
Session session = source.newSession();
ModuleInvoke request = session.newModuleInvoke(moduleUrl)
println "Group number: ${index}, ${new Date()}"
request.setNewIntegerVariable("", "groupNum", index);
request.setNewIntegerVariable("", "length", 10000);
request.setOptions(options);
ResultSequence rs = session.submitRequest(request);
ResultItem item = rs.next();
XdmItem xdmItem = item.getItem();
InputStream is = item.asInputStream();
is.eachLine { line ->
outfile.append line
outfile.append "\n"
}
session.close();
}
outfile.append "</article-metadata-counts>";
println "Done."
// ==== End of script.
--
Eliot Kimber
http://contrext.com
On 5/22/17, 10:43 PM, "[email protected] on behalf of
Eliot Kimber" <[email protected] on behalf of
[email protected]> wrote:
I haven’t yet seen anything in the docs that directly address what I’m
trying to do and suspect I’m simply missing some ML basics or just going about
things the wrong way.
I have a corpus of several hundred thousand docs (but could be millions, of
course), where each doc is an average of 200K and several thousand elements.
I want to analyze the corpus to get details about the number of specific
subelements within each document, e.g.:
for $article in cts:search(/Article, cts:directory-query("/Default/",
"infinity"))[$start to $end]
return <article-counts id=”{$article/@id}”
paras=”{count($article//p}”/>
I’m running this as a query from Oxygen (so I can capture the results
locally so I can do other stuff with them).
On the server I’m using I blow the expanded tree cache if I try to request
more than about 20,000 docs.
Is there a way to do this kind of processing over an arbitrarily large set
*and* get the results back from a single query request?
I think the only solution is to write the results to back to the database
and then fetch that as the last thing but I was hoping there was something
simpler.
Have I missed an obvious solution?
Thanks,
Eliot
--
Eliot Kimber
http://contrext.com
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general