RE: [MarkLogic Dev General] CORB: Sleep during configurable hoursandprocess 1 forest at a time

Hartwig, Brent (CL Tech Sv) Wed, 24 Sep 2008 13:49:58 -0700

The insight is invaluable. Below is our modified approach:

1. Use Michael's XQuery for selecting URIs that are sorted by forest.


2. Modify Corb to process x URIs before resting for y seconds. I believe Corb 
already handles the backup state.

3. Limit the size of automated merges; manually merge during off hours.

This will be quicker to pull together. My thanks to all that chimed in. Not 
sure when but I intend to post the outcome.

-Brent

________________________________
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ian Small
Sent: Wednesday, September 24, 2008 1:59 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] CORB: Sleep during configurable 
hoursandprocess 1 forest at a time

Oops, that should be "fewer", not "newer":
And you'll need *fewer* transactions to fill up your in-memory stand before it 
gets flushed.  But the flush of that stand to disk will be as efficient as it 
was previously.

ian

________________________________
From: Ian Small
Sent: Wednesday, September 24, 2008 10:56 AM
To: 'General Mark Logic Developer Discussion'
Subject: RE: [MarkLogic Dev General] CORB: Sleep during configurable 
hoursandprocess 1 forest at a time
Ah, excellent question.

When you're updating a document in the database, the following things need to 
happen:

1) read the document off disk
2) update the appropriate nodes in the document, which:
    a) writes journal entries (for recovery purposes, if needed) to disk
    b) appends fragments (or if you have no fragmentation policy, a document) 
to an in-memory stand
    c) marks the old fragments (or document) as obsolete
3) only when an in-memory stand is full (or write activity appears to quiesce) 
will the server flush the in-memory stand to disk in a highly-efficient 
sequential write operation

So:
Obviously, the larger the document, the more I/O the read will consume.
And if you're working with large unfragmented documents, the journal entries 
may be bigger.  But the good news is that writing to the journal is pretty 
efficient (as you would expect).
And you'll need newer transactions to fill up your in-memory stand before it 
gets flushed.  But the flush of that stand to disk will be as efficient as it 
was previously.
Marking old fragments or documents obsolete is noise for the purposes of this 
discussion.

So if I were in your position, I might do fewer transactions in each burst 
before "resting".

I think the last time we rolled our content set a few weeks back, we did 100 
then slept for a second, but our docs tend to be sub 100K.  Our servers handled 
this without blinking.  It wasn't that 100 was number that was tuned up to the 
maximum potential.  What we did was model this rate and show that it converged 
on completion fast enough for our purposes, so we went with it because our 
run-time performance degraded minimally at this level (our goal is to serve 
message pages in less than 50ms, way less when possible).  In your case, you 
might want to experiment with some lower numbers, depending mostly on your I/O 
system's ability to sustain throughput.  (It wouldn't surprise me if your I/O 
system was more capable than the one we're using...)

ian

________________________________
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hartwig, Brent 
(CL Tech Sv)
Sent: Wednesday, September 24, 2008 4:37 AM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] CORB: Sleep during configurable 
hoursandprocess 1 forest at a time
Hi, Ian,

Quite a lot to chew on - thank you! I understand your process is able to run 
continuously yet, to keep the site running smoothly, the process takes short 
breaks and imposes a size limit on the merges. That size limit requires you to 
initiate the residual merge during a low usage period.

Do you believe the size of the documents being updated would impact this 
approach? Our files can be quite large. It is common for folders to include 
multiple 10 - 20 MB files. We do have files approaching the 300 MB limit.

-Brent

________________________________
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ian Small
Sent: Tuesday, September 23, 2008 2:52 PM
To: General Mark Logic Developer Discussion
Subject: RE: [MarkLogic Dev General] CORB: Sleep during configurable hours 
andprocess 1 forest at a time

hi -

while we don't use corb to do it, we do in fact to large-scale in-place 
modifications of the markmail.org production content set.  we take a similar 
approach to yours:
- only work on one forest (in our case, per D node) at once
- we manage the concurrency of the work to make sure there are lots of cores 
available for user queries
- we pause in between small bursts of reprocessing
- we manage monster merges manually so that they happen during our low usage 
time (we have global users, so this is between about 6pm and 2 am pacific)

we do all this because, like you, we are working around live load on the server 
and need to maintain response time while all this is going on

some things to keep note of:
- pausing between every operation can backfire - because if the pause is long 
enough, it can "trick" the server into thinking that there are no more updates 
coming, which can cause an in-memory stand to be flushed out to disk.  the 
result if this is that a bunch of really small in-memory stands can get shot 
out to disk, requiring more merges - although those merges will be incredibly 
fast and incredibly lightweight.  so we tend to keep our pauses short enough to 
make sure we give other processing some time to get through.  so you may want 
to experiment on this front a little bit.
- we NEVER turn off merges - this is essentially playing russian roulette, and 
committing to pull the trigger 12 times while waiting to see what happens.  
what we do is limit the large merges (where large is compared to our forest 
size).  in our case, with 200 GB forests, we might stick our limit at 75-100 
GB, for instance.  that generally leaves us with a forest with 2-4 stands in 
it, which we can then merge manually in low times.
- we start the manual "all done" merges using the forest admin pages

in general, we take this approach because we plan our reprocessing sufficient 
in advance that we can have it take days, sometimes even 10 days.  we haven't 
had to be in a crash program to have to rework the content set so i can't share 
any real-world experience there.

ian


________________________________
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Hartwig, Brent 
(CL Tech Sv)
Sent: Tuesday, September 23, 2008 8:03 AM
To: [email protected]
Subject: [MarkLogic Dev General] CORB: Sleep during configurable hours 
andprocess 1 forest at a time
Hello,

Has anyone extended Corb to sleep during configurable periods or process one 
forest at a time?

We need to modify every object in our ML instance. Multiple merges are 
saturating the IO channel. To keep production stable and usable, we intend to 
put the job to sleep during peak hours and only process one forest at a time. 
Each processed URI will go into a collection, allowing us to verify all are 
processed. Preliminary approaches are described below. Your thoughts and 
experience are welcome. Thank you in advance.

Sleep: Nothing too concerning here (but tried & true is always better). We're 
planning to work around backups, peak hours and allow time for system resources 
to recover before peak hours resume.

Forest: Corb can obtain a list of forests from the specified database via 
Session.getContentbaseMetaData().getForestIds() and iterate in serial. The 
queue would be populated once per forest by substituting the forest ID within 
the provided URIS-MODULE. The initial implementation may impose some usage 
constraints.

-Brent

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] CORB: Sleep during configurable hoursandprocess 1 forest at a time

Reply via email to