hi -
 
while we don't use corb to do it, we do in fact to large-scale in-place
modifications of the markmail.org production content set.  we take a
similar approach to yours:
- only work on one forest (in our case, per D node) at once
- we manage the concurrency of the work to make sure there are lots of
cores available for user queries
- we pause in between small bursts of reprocessing
- we manage monster merges manually so that they happen during our low
usage time (we have global users, so this is between about 6pm and 2 am
pacific)
 
we do all this because, like you, we are working around live load on the
server and need to maintain response time while all this is going on
 
some things to keep note of:
- pausing between every operation can backfire - because if the pause is
long enough, it can "trick" the server into thinking that there are no
more updates coming, which can cause an in-memory stand to be flushed
out to disk.  the result if this is that a bunch of really small
in-memory stands can get shot out to disk, requiring more merges -
although those merges will be incredibly fast and incredibly
lightweight.  so we tend to keep our pauses short enough to make sure we
give other processing some time to get through.  so you may want to
experiment on this front a little bit.
- we NEVER turn off merges - this is essentially playing russian
roulette, and committing to pull the trigger 12 times while waiting to
see what happens.  what we do is limit the large merges (where large is
compared to our forest size).  in our case, with 200 GB forests, we
might stick our limit at 75-100 GB, for instance.  that generally leaves
us with a forest with 2-4 stands in it, which we can then merge manually
in low times.
- we start the manual "all done" merges using the forest admin pages
 
in general, we take this approach because we plan our reprocessing
sufficient in advance that we can have it take days, sometimes even 10
days.  we haven't had to be in a crash program to have to rework the
content set so i can't share any real-world experience there.
 
ian
 


________________________________

        From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Hartwig,
Brent (CL Tech Sv)
        Sent: Tuesday, September 23, 2008 8:03 AM
        To: [email protected]
        Subject: [MarkLogic Dev General] CORB: Sleep during configurable
hours andprocess 1 forest at a time
        
        

        Hello,

         

        Has anyone extended Corb to sleep during configurable periods or
process one forest at a time?

         

        We need to modify every object in our ML instance. Multiple
merges are saturating the IO channel. To keep production stable and
usable, we intend to put the job to sleep during peak hours and only
process one forest at a time. Each processed URI will go into a
collection, allowing us to verify all are processed. Preliminary
approaches are described below. Your thoughts and experience are
welcome. Thank you in advance.

         

        Sleep: Nothing too concerning here (but tried & true is always
better). We're planning to work around backups, peak hours and allow
time for system resources to recover before peak hours resume.

         

        Forest: Corb can obtain a list of forests from the specified
database via Session.getContentbaseMetaData().getForestIds() and iterate
in serial. The queue would be populated once per forest by substituting
the forest ID within the provided URIS-MODULE. The initial
implementation may impose some usage constraints.

         

        -Brent

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to