Hi Jim,
Why do you have to update the report document each time you get a hit?
How many documents do you expect to match--10, 1000, 100000? I am
blissfully ignorant of CORB, so maybe (probably...) I am looking at this
naively, but it seems like you might be able to do this all in XQuery.
I am thinking maybe break it up into batches of, say, 1000 documents,
then spawn the update to the report for each 1000 hits. If you can
constrain your cts:uri-match with a cts:query that could make it so you
don't have to look at every document, that would help too. Here is
pseudo-code of what I mean:
let $report-batch :=
<batch>{
let $query := some-query-that-all-my-candidate-docs-must-match
for $x in cts:uri-match("myprefix*", (), $query)[1 to 1000]
return
<match>
<uri>{$x}</uri>
<report>{$x/foo (: whatever you want to extract
from the doc for the report :)}</report>
<match>
}</batch>
return
xdmp:spawn("/my-update-module", ((xs:QName("batch"), $report-batch)))
Then if you are worried this document will get too big, you can fragment
it on <batch>.
You might be able to take a similar approach with CORB too.
-Danny
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of James A.
Robinson
Sent: Thursday, March 05, 2009 8:22 AM
To: [email protected]
Subject: [MarkLogic Dev General] CORB reporting
Hi folks,
I've got several reports I need to occasionally run across our entire
database, about 3.5 million documents and growing, and I was wondering
if anyone here could share their experiences with handling concurrent
updates to a report document when using CORB w/ multiple threads.
Basically I've got listing module that does something simple like using
cts:uri-match to grab all $URI with a certain extension, and then a
report which, for each $URI it is run against, determines if the file
is of interest.
If the $URI is of interest, I need to record that fact somewhere.
What I have been doing is:
(1) Outside of CORB, using xdmp:document-insert to create
a new report document
(2) spin up CORB and have it us xdmp:node-insert-child against
the root element of the report document for each $URI of
interest, adding an element() with the necessary data I want
to report on.x
I may be misunderstanding the Developer Guide, but I'm under the
impression this should be safe, and that MLS will detect multiple
udpates, back out, and try again. This appears to be the behavior that
I observe as well. I do worry about the number of notifications in the
log regarding retries, but it does seem to all work.
Two problems I have with this are
(1) It'd be nice to somehow wrap a synchronized creation of the
report document inside of the main module, so that I don't have
to manually do anything. I can't see how to do this w/o having
something like an existing, dummy, locking document available and
having a contract between authors of reporting modules to use that
as the synchronization point when creating new report documents.
I was curious about the naive if (document-available(..)) then
xdmp:child-insert(...) else xdmp:document-insert(...) approach,
but some tests indicated that was not thread safe, and that
records
in the report could go missing.
(2) I worry about the size of the document and that MLS will start to
slow down if we ever have to write a report which has many many
child entries (one early attempt where I was appending text()
nodes instead of element() nodes demonstrated this problem fairly
quickly, I have to imagine MLS was doing something like a per-call
concatentation of the previous text() and the new text() for every
update).
I suppose the obvious answers are to either figure out a way
to split reports across sub-report documents (or have them as
sub-report children of a single report document and configure
MLS to fragment on them), but I was wondering if anyone here had
particular techniques they used which they found worked well?
Jim
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
James A. Robinson [email protected]
Stanford University HighWire Press http://highwire.stanford.edu/
+1 650 7237294 (Work) +1 650 7259335 (Fax)
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general