[MarkLogic Dev General] CORB reporting

James A. Robinson Thu, 05 Mar 2009 08:22:34 -0800

Hi folks,

I've got several reports I need to occasionally run across our entire
database, about 3.5 million documents and growing, and I was wondering
if anyone here could share their experiences with handling concurrent
updates to a report document when using CORB w/ multiple threads.


Basically I've got listing module that does something simple like using
cts:uri-match to grab all $URI with a certain extension, and then a
report which, for each $URI it is run against, determines if the file
is of interest.

If the $URI is of interest, I need to record that fact somewhere.
What I have been doing is:

  (1) Outside of CORB, using xdmp:document-insert to create
      a new report document

  (2) spin up CORB and have it us xdmp:node-insert-child against
      the root element of the report document for each $URI of
      interest, adding an element() with the necessary data I want
      to report on.x

I may be misunderstanding the Developer Guide, but I'm under the
impression this should be safe, and that MLS will detect multiple
udpates, back out, and try again.  This appears to be the behavior that
I observe as well.  I do worry about the number of notifications in the
log regarding retries, but it does seem to all work.

Two problems I have with this are

  (1) It'd be nice to somehow wrap a synchronized creation of the
      report document inside of the main module, so that I don't have
      to manually do anything.  I can't see how to do this w/o having
      something like an existing, dummy, locking document available and
      having a contract between authors of reporting modules to use that
      as the synchronization point when creating new report documents.

      I was curious about the naive if (document-available(..)) then
      xdmp:child-insert(...) else xdmp:document-insert(...) approach,
      but some tests indicated that was not thread safe, and that records
      in the report could go missing.

  (2) I worry about the size of the document and that MLS will start to
      slow down if we ever have to write a report which has many many
      child entries (one early attempt where I was appending text()
      nodes instead of element() nodes demonstrated this problem fairly
      quickly, I have to imagine MLS was doing something like a per-call
      concatentation of the previous text() and the new text() for every
      update).

      I suppose the obvious answers are to either figure out a way
      to split reports across sub-report documents (or have them as
      sub-report children of a single report document and configure
      MLS to fragment on them), but I was wondering if anyone here had
      particular techniques they used which they found worked well?


Jim

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
James A. Robinson                       [email protected]
Stanford University HighWire Press      http://highwire.stanford.edu/
+1 650 7237294 (Work)                   +1 650 7259335 (Fax)
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] CORB reporting

Reply via email to