Ah, excellent question.
 
When you're updating a document in the database, the following things
need to happen:
 
1) read the document off disk
2) update the appropriate nodes in the document, which:
    a) writes journal entries (for recovery purposes, if needed) to disk
    b) appends fragments (or if you have no fragmentation policy, a
document) to an in-memory stand
    c) marks the old fragments (or document) as obsolete
3) only when an in-memory stand is full (or write activity appears to
quiesce) will the server flush the in-memory stand to disk in a
highly-efficient sequential write operation
 
So:
Obviously, the larger the document, the more I/O the read will consume.
And if you're working with large unfragmented documents, the journal
entries may be bigger.  But the good news is that writing to the journal
is pretty efficient (as you would expect).
And you'll need newer transactions to fill up your in-memory stand
before it gets flushed.  But the flush of that stand to disk will be as
efficient as it was previously.
Marking old fragments or documents obsolete is noise for the purposes of
this discussion.
 
So if I were in your position, I might do fewer transactions in each
burst before "resting".
 
I think the last time we rolled our content set a few weeks back, we did
100 then slept for a second, but our docs tend to be sub 100K.  Our
servers handled this without blinking.  It wasn't that 100 was number
that was tuned up to the maximum potential.  What we did was model this
rate and show that it converged on completion fast enough for our
purposes, so we went with it because our run-time performance degraded
minimally at this level (our goal is to serve message pages in less than
50ms, way less when possible).  In your case, you might want to
experiment with some lower numbers, depending mostly on your I/O
system's ability to sustain throughput.  (It wouldn't surprise me if
your I/O system was more capable than the one we're using...)
 
ian


________________________________

        From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Hartwig,
Brent (CL Tech Sv)
        Sent: Wednesday, September 24, 2008 4:37 AM
        To: General Mark Logic Developer Discussion
        Subject: RE: [MarkLogic Dev General] CORB: Sleep during
configurable hoursandprocess 1 forest at a time
        
        

        Hi, Ian,

         

        Quite a lot to chew on - thank you! I understand your process is
able to run continuously yet, to keep the site running smoothly, the
process takes short breaks and imposes a size limit on the merges. That
size limit requires you to initiate the residual merge during a low
usage period.

         

        Do you believe the size of the documents being updated would
impact this approach? Our files can be quite large. It is common for
folders to include multiple 10 - 20 MB files. We do have files
approaching the 300 MB limit.

         

        -Brent

         

        
________________________________


        From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Ian Small
        Sent: Tuesday, September 23, 2008 2:52 PM
        To: General Mark Logic Developer Discussion
        Subject: RE: [MarkLogic Dev General] CORB: Sleep during
configurable hours andprocess 1 forest at a time

         

        hi -

         

        while we don't use corb to do it, we do in fact to large-scale
in-place modifications of the markmail.org production content set.  we
take a similar approach to yours:

        - only work on one forest (in our case, per D node) at once

        - we manage the concurrency of the work to make sure there are
lots of cores available for user queries

        - we pause in between small bursts of reprocessing

        - we manage monster merges manually so that they happen during
our low usage time (we have global users, so this is between about 6pm
and 2 am pacific)

         

        we do all this because, like you, we are working around live
load on the server and need to maintain response time while all this is
going on

         

        some things to keep note of:

        - pausing between every operation can backfire - because if the
pause is long enough, it can "trick" the server into thinking that there
are no more updates coming, which can cause an in-memory stand to be
flushed out to disk.  the result if this is that a bunch of really small
in-memory stands can get shot out to disk, requiring more merges -
although those merges will be incredibly fast and incredibly
lightweight.  so we tend to keep our pauses short enough to make sure we
give other processing some time to get through.  so you may want to
experiment on this front a little bit.

        - we NEVER turn off merges - this is essentially playing russian
roulette, and committing to pull the trigger 12 times while waiting to
see what happens.  what we do is limit the large merges (where large is
compared to our forest size).  in our case, with 200 GB forests, we
might stick our limit at 75-100 GB, for instance.  that generally leaves
us with a forest with 2-4 stands in it, which we can then merge manually
in low times.

        - we start the manual "all done" merges using the forest admin
pages

         

        in general, we take this approach because we plan our
reprocessing sufficient in advance that we can have it take days,
sometimes even 10 days.  we haven't had to be in a crash program to have
to rework the content set so i can't share any real-world experience
there.

         

        ian

         

                 

                
________________________________


                From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Hartwig,
Brent (CL Tech Sv)
                Sent: Tuesday, September 23, 2008 8:03 AM
                To: [email protected]
                Subject: [MarkLogic Dev General] CORB: Sleep during
configurable hours andprocess 1 forest at a time

                Hello,

                 

                Has anyone extended Corb to sleep during configurable
periods or process one forest at a time?

                 

                We need to modify every object in our ML instance.
Multiple merges are saturating the IO channel. To keep production stable
and usable, we intend to put the job to sleep during peak hours and only
process one forest at a time. Each processed URI will go into a
collection, allowing us to verify all are processed. Preliminary
approaches are described below. Your thoughts and experience are
welcome. Thank you in advance.

                 

                Sleep: Nothing too concerning here (but tried & true is
always better). We're planning to work around backups, peak hours and
allow time for system resources to recover before peak hours resume.

                 

                Forest: Corb can obtain a list of forests from the
specified database via Session.getContentbaseMetaData().getForestIds()
and iterate in serial. The queue would be populated once per forest by
substituting the forest ID within the provided URIS-MODULE. The initial
implementation may impose some usage constraints.

                 

                -Brent

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to