Hi, Kelly,

We have both XML and binary. In this particular case, we're changing the 
privileges on all files and folders via xdmp:document-add-permissions(). We're 
also adding each processed URI to a collection via 
xdmp:document-add-collections(). Thank you.

-Brent

________________________________
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Kelly Stirman
Sent: Wednesday, September 24, 2008 2:08 PM
To: [email protected]
Subject: [MarkLogic Dev General] RE: CORB: Sleep during 
configurablehoursandprocess 1 forest at a time


Sorry to arrive late to the thread - are your documents xml or binary?

I am wondering if you can take advantage of properties to manage your updates. 
This would make your updates substantially smaller than updating the entire 
file, and therefore much more efficient.

You may have already considered and dismissed this option. Agologies if this is 
the case.

Kelly

----- Original Message -----
From: [EMAIL PROTECTED] <[EMAIL PROTECTED]>
To: [email protected] <[email protected]>
Sent: Wed Sep 24 10:46:28 2008
Subject: General Digest, Vol 51, Issue 27

Send General mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://xqzone.com/mailman/listinfo/general
or, via email, send a message with subject or body 'help' to
        [EMAIL PROTECTED]

You can reach the person managing the list at
        [EMAIL PROTECTED]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of General digest..."


Today's Topics:

   1. RE: CORB: Sleep during configurable       hoursandprocess 1 forest
      at a time (Ian Small)


----------------------------------------------------------------------

Message: 1
Date: Wed, 24 Sep 2008 10:59:08 -0700
From: "Ian Small" <[EMAIL PROTECTED]>
Subject: RE: [MarkLogic Dev General] CORB: Sleep during configurable
        hoursandprocess 1 forest at a time
To: "General Mark Logic Developer Discussion"
        <[email protected]>
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset="us-ascii"

Oops, that should be "fewer", not "newer":
And you'll need *fewer* transactions to fill up your in-memory stand
before it gets flushed.  But the flush of that stand to disk will be as
efficient as it was previously.

ian


________________________________

        From: Ian Small
        Sent: Wednesday, September 24, 2008 10:56 AM
        To: 'General Mark Logic Developer Discussion'
        Subject: RE: [MarkLogic Dev General] CORB: Sleep during
configurable hoursandprocess 1 forest at a time


        Ah, excellent question.

        When you're updating a document in the database, the following
things need to happen:

        1) read the document off disk
        2) update the appropriate nodes in the document, which:
            a) writes journal entries (for recovery purposes, if needed)
to disk
            b) appends fragments (or if you have no fragmentation
policy, a document) to an in-memory stand
            c) marks the old fragments (or document) as obsolete
        3) only when an in-memory stand is full (or write activity
appears to quiesce) will the server flush the in-memory stand to disk in
a highly-efficient sequential write operation

        So:
        Obviously, the larger the document, the more I/O the read will
consume.
        And if you're working with large unfragmented documents, the
journal entries may be bigger.  But the good news is that writing to the
journal is pretty efficient (as you would expect).
        And you'll need newer transactions to fill up your in-memory
stand before it gets flushed.  But the flush of that stand to disk will
be as efficient as it was previously.
        Marking old fragments or documents obsolete is noise for the
purposes of this discussion.

        So if I were in your position, I might do fewer transactions in
each burst before "resting".

        I think the last time we rolled our content set a few weeks
back, we did 100 then slept for a second, but our docs tend to be sub
100K.  Our servers handled this without blinking.  It wasn't that 100
was number that was tuned up to the maximum potential.  What we did was
model this rate and show that it converged on completion fast enough for
our purposes, so we went with it because our run-time performance
degraded minimally at this level (our goal is to serve message pages in
less than 50ms, way less when possible).  In your case, you might want
to experiment with some lower numbers, depending mostly on your I/O
system's ability to sustain throughput.  (It wouldn't surprise me if
your I/O system was more capable than the one we're using...)

        ian


________________________________

                From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Hartwig,
Brent (CL Tech Sv)
                Sent: Wednesday, September 24, 2008 4:37 AM
                To: General Mark Logic Developer Discussion
                Subject: RE: [MarkLogic Dev General] CORB: Sleep during
configurable hoursandprocess 1 forest at a time



                Hi, Ian,



                Quite a lot to chew on - thank you! I understand your
process is able to run continuously yet, to keep the site running
smoothly, the process takes short breaks and imposes a size limit on the
merges. That size limit requires you to initiate the residual merge
during a low usage period.



                Do you believe the size of the documents being updated
would impact this approach? Our files can be quite large. It is common
for folders to include multiple 10 - 20 MB files. We do have files
approaching the 300 MB limit.



                -Brent




________________________________


                From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Ian Small
                Sent: Tuesday, September 23, 2008 2:52 PM
                To: General Mark Logic Developer Discussion
                Subject: RE: [MarkLogic Dev General] CORB: Sleep during
configurable hours andprocess 1 forest at a time



                hi -



                while we don't use corb to do it, we do in fact to
large-scale in-place modifications of the markmail.org production
content set.  we take a similar approach to yours:

                - only work on one forest (in our case, per D node) at
once

                - we manage the concurrency of the work to make sure
there are lots of cores available for user queries

                - we pause in between small bursts of reprocessing

                - we manage monster merges manually so that they happen
during our low usage time (we have global users, so this is between
about 6pm and 2 am pacific)



                we do all this because, like you, we are working around
live load on the server and need to maintain response time while all
this is going on



                some things to keep note of:

                - pausing between every operation can backfire - because
if the pause is long enough, it can "trick" the server into thinking
that there are no more updates coming, which can cause an in-memory
stand to be flushed out to disk.  the result if this is that a bunch of
really small in-memory stands can get shot out to disk, requiring more
merges - although those merges will be incredibly fast and incredibly
lightweight.  so we tend to keep our pauses short enough to make sure we
give other processing some time to get through.  so you may want to
experiment on this front a little bit.

                - we NEVER turn off merges - this is essentially playing
russian roulette, and committing to pull the trigger 12 times while
waiting to see what happens.  what we do is limit the large merges
(where large is compared to our forest size).  in our case, with 200 GB
forests, we might stick our limit at 75-100 GB, for instance.  that
generally leaves us with a forest with 2-4 stands in it, which we can
then merge manually in low times.

                - we start the manual "all done" merges using the forest
admin pages



                in general, we take this approach because we plan our
reprocessing sufficient in advance that we can have it take days,
sometimes even 10 days.  we haven't had to be in a crash program to have
to rework the content set so i can't share any real-world experience
there.



                ian






________________________________


                        From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Hartwig,
Brent (CL Tech Sv)
                        Sent: Tuesday, September 23, 2008 8:03 AM
                        To: [email protected]
                        Subject: [MarkLogic Dev General] CORB: Sleep
during configurable hours andprocess 1 forest at a time

                        Hello,



                        Has anyone extended Corb to sleep during
configurable periods or process one forest at a time?



                        We need to modify every object in our ML
instance. Multiple merges are saturating the IO channel. To keep
production stable and usable, we intend to put the job to sleep during
peak hours and only process one forest at a time. Each processed URI
will go into a collection, allowing us to verify all are processed.
Preliminary approaches are described below. Your thoughts and experience
are welcome. Thank you in advance.



                        Sleep: Nothing too concerning here (but tried &
true is always better). We're planning to work around backups, peak
hours and allow time for system resources to recover before peak hours
resume.



                        Forest: Corb can obtain a list of forests from
the specified database via
Session.getContentbaseMetaData().getForestIds() and iterate in serial.
The queue would be populated once per forest by substituting the forest
ID within the provided URIS-MODULE. The initial implementation may
impose some usage constraints.



                        -Brent

-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
http://xqzone.marklogic.com/pipermail/general/attachments/20080924/89d01c79/attachment.html

------------------------------

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general


End of General Digest, Vol 51, Issue 27
***************************************
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to