Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Eliot Kimber Wed, 24 May 2017 06:38:41 -0700

I got what I needed by creating a simple groovy script that uses the XCC 
library to submit queries. Script is below. My main discovery was that I need 
to create a new session for every iteration to avoid connection time outs. With 
this I was able to process several 100 thousand docs and accumulate the results 
on my local machine. My command line is:


groovy -cp lib/xcc.jar GetArticleMetadataDetails.groovy

I chose groovy because it supports Java libraries directly and makes it easy to 
script things.

Groovy script:

#!/usr/bin/env groovy
/*
 * Use XCC jar to run enrichment jobs and collect the results.
 */
 
import com.marklogic.xcc.*;
import com.marklogic.xcc.types.*;
 
ContentSource source = ContentSourceFactory.newContentSource("myserver", 1984, 
"user", "pw");

RequestOptions options = new RequestOptions();
options.setRequestTimeLimit(3600)

moduleUrl = "rq-metadata-analysis.xqy"

println "Running module ${moduleUrl}..."
println new Date()
File outfile = new File("query-result.xml")

outfile.write "<article-metadata-counts>\n";

 
(36..56).each { index ->
    Session session = source.newSession();
    ModuleInvoke request = session.newModuleInvoke(moduleUrl)

    println "Group number: ${index}, ${new Date()}"
    request.setNewIntegerVariable("", "groupNum", index);
    request.setNewIntegerVariable("", "length", 10000);

    request.setOptions(options);
    
    ResultSequence rs = session.submitRequest(request);
    
    ResultItem item = rs.next();
    XdmItem xdmItem = item.getItem();
    InputStream is = item.asInputStream();
    
    is.eachLine { line ->
      outfile.append line
      outfile.append "\n"
    }
    session.close();
}

outfile.append "</article-metadata-counts>";

println "Done."
// ==== End of script.

--
Eliot Kimber
http://contrext.com
 



On 5/22/17, 10:43 PM, "[email protected] on behalf of 
Eliot Kimber" <[email protected] on behalf of 
[email protected]> wrote:

    I haven’t yet seen anything in the docs that directly address what I’m 
trying to do and suspect I’m simply missing some ML basics or just going about 
things the wrong way.
    
    I have a corpus of several hundred thousand docs (but could be millions, of 
course), where each doc is an average of 200K and several thousand elements.
    
    I want to analyze the corpus to get details about the number of specific 
subelements within each document, e.g.:
    
    
    for $article in cts:search(/Article, cts:directory-query("/Default/", 
"infinity"))[$start to $end]
         return <article-counts id=”{$article/@id}” 
paras=”{count($article//p}”/>
    
    I’m running this as a query from Oxygen (so I can capture the results 
locally so I can do other stuff with them).
    
    On the server I’m using I blow the expanded tree cache if I try to request 
more than about 20,000 docs.
    
    Is there a way to do this kind of processing over an arbitrarily large set 
*and* get the results back from a single query request?
    
    I think the only solution is to write the results to back to the database 
and then fetch that as the last thing but I was hoping there was something 
simpler.
    
    Have I missed an obvious solution?
    
    Thanks,
    
    Eliot
    
    --
    Eliot Kimber
    http://contrext.com
     
    
    
    
    _______________________________________________
    General mailing list
    [email protected]
    Manage your subscription at: 
    http://developer.marklogic.com/mailman/listinfo/general
    


_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Processing Large Number of Docs to Get Statistics

Reply via email to