Hi Jim,

Why do you have to update the report document each time you get a hit?
How many documents do you expect to match--10, 1000, 100000?  I am
blissfully ignorant of CORB, so maybe (probably...) I am looking at this
naively, but it seems like you might be able to do this all in XQuery.
I am thinking maybe break it up into batches of, say, 1000 documents,
then spawn the update to the report for each 1000 hits.  If you can
constrain your cts:uri-match with a cts:query that could make it so you
don't have to look at every document, that would help too.  Here is
pseudo-code of what I mean:

let $report-batch :=
<batch>{
 let $query := some-query-that-all-my-candidate-docs-must-match
 for $x in cts:uri-match("myprefix*", (), $query)[1 to 1000]
 return
 <match>
   <uri>{$x}</uri>
   <report>{$x/foo (: whatever you want to extract 
         from the doc for the report :)}</report>
 <match>
}</batch>
return
xdmp:spawn("/my-update-module", ((xs:QName("batch"), $report-batch)))

Then if you are worried this document will get too big, you can fragment
it on <batch>.

You might be able to take a similar approach with CORB too.

-Danny

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of James A.
Robinson
Sent: Thursday, March 05, 2009 8:22 AM
To: [email protected]
Subject: [MarkLogic Dev General] CORB reporting


Hi folks,

I've got several reports I need to occasionally run across our entire
database, about 3.5 million documents and growing, and I was wondering
if anyone here could share their experiences with handling concurrent
updates to a report document when using CORB w/ multiple threads.

Basically I've got listing module that does something simple like using
cts:uri-match to grab all $URI with a certain extension, and then a
report which, for each $URI it is run against, determines if the file
is of interest.

If the $URI is of interest, I need to record that fact somewhere.
What I have been doing is:

  (1) Outside of CORB, using xdmp:document-insert to create
      a new report document

  (2) spin up CORB and have it us xdmp:node-insert-child against
      the root element of the report document for each $URI of
      interest, adding an element() with the necessary data I want
      to report on.x

I may be misunderstanding the Developer Guide, but I'm under the
impression this should be safe, and that MLS will detect multiple
udpates, back out, and try again.  This appears to be the behavior that
I observe as well.  I do worry about the number of notifications in the
log regarding retries, but it does seem to all work.

Two problems I have with this are

  (1) It'd be nice to somehow wrap a synchronized creation of the
      report document inside of the main module, so that I don't have
      to manually do anything.  I can't see how to do this w/o having
      something like an existing, dummy, locking document available and
      having a contract between authors of reporting modules to use that
      as the synchronization point when creating new report documents.

      I was curious about the naive if (document-available(..)) then
      xdmp:child-insert(...) else xdmp:document-insert(...) approach,
      but some tests indicated that was not thread safe, and that
records
      in the report could go missing.

  (2) I worry about the size of the document and that MLS will start to
      slow down if we ever have to write a report which has many many
      child entries (one early attempt where I was appending text()
      nodes instead of element() nodes demonstrated this problem fairly
      quickly, I have to imagine MLS was doing something like a per-call
      concatentation of the previous text() and the new text() for every
      update).

      I suppose the obvious answers are to either figure out a way
      to split reports across sub-report documents (or have them as
      sub-report children of a single report document and configure
      MLS to fragment on them), but I was wondering if anyone here had
      particular techniques they used which they found worked well?


Jim

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
James A. Robinson                       [email protected]
Stanford University HighWire Press      http://highwire.stanford.edu/
+1 650 7237294 (Work)                   +1 650 7259335 (Fax)
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to