Another thing to think about if you did decide to put this on one disk is to 
also invest in an ssd drive for that host, and then specify the path to the ssd 
drive as the fast-data directory for the forests on that host.  That will speed 
up the updates to those forests, even if the data directory filesystem is not 
super fast.  That might be a good way to give you the best of both worlds that 
you are looking for here.

You need 5.0 to use the fast data directory, and I will repeat Mike's 
recommendation to go to 5.0 instead of 4.2.  You should be able to go directly 
to 5.0 from 4.1.  It will not be much (if any) harder for you to do, and there 
will be many benefits.  Just do your application and upgrade testing on 5.0 
instead of 4.2.

-Danny

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Michael Blakeley
Sent: Monday, July 16, 2012 7:10 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Migrating content to another database

Not normally, no.

The CPF configuration is stored in the Triggers database, and describes how CPF 
should run. By design, this changes rarely. The CPF state of each document is 
stored in document properties, which are stored in the same database as the 
documents themselves. These change several times after each update, as the 
document follows the CPF pipeline.

-- Mike

On 15 Jul 2012, at 20:02 , Tim Meagher wrote:

> Hi Mike,
> 
> When you say that CPF does quite a few journal writes, does that take place
> in the associated triggers database, i.e., should the triggers database be
> carefully architected as well?  
> 
> Tim
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Michael
> Blakeley
> Sent: Sunday, July 15, 2012 9:38 PM
> To: MarkLogic Developer Discussion
> Subject: Re: [MarkLogic Dev General] Migrating content to another database
> 
> All else equal, RAID-5 is likely to be the biggest problem. RAID-5/6 are
> pretty bad for write performance, which will impact journal writes and large
> merges. CPF does quite a few journal writes, and merges just happen.
> 
> Here's a test you might try: back up your old forests to the new device, and
> then restore them to a set of newly-created forests on the same device.
> Force a merge on all three, and see how long it takes to finish. You can
> compare with the old devices, which will give you a good idea of just how
> much your new device configuration would impact performance. All things may
> not be equal, so it may be an improvement despite my knee-jerk reaction.
> 
> -- Mike
> 
> On 15 Jul 2012, at 11:51 , Tim Meagher wrote:
> 
>> Hi Michael,
>> 
>> Thanks for following up on this.  My database currently has 
>> approximately 8M documents taking up over 460BG of space split across 
>> 3 forests on 3 separate
>> 500 GB devices.  I retain documents at various phases of production 
>> processed by more than 1 CPF domain (input, intermediate which takes a 
>> while to produce, and final formats).  The plan is to go to a single 
>> 1.5T local device in a RAID 5 configuration.  I was considering just 
>> copying the forests over to the new device to simplify the content 
>> migration, but my numbers don't work with your recommendations for 
>> performance.  Sounds like I should only have one or two destination 
>> forests, but without re-architecting my dataflow I way exceed the max 
>> number of GBs per database.  At what point does performance degradation
> become noticeable?
>> 
>> Thank you!
>> 
>> Tim Meagher
>> 
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Michael 
>> Blakeley
>> Sent: Sunday, July 15, 2012 2:15 PM
>> To: MarkLogic Developer Discussion
>> Subject: Re: [MarkLogic Dev General] Migrating content to another 
>> database
>> 
>> I know I am late with this reply, but I wanted to plug 
>> https://github.com/mblakele/task-rebalancer for this job. The downside 
>> is that it requires 5.0, or some patches for 4.2. But the 5.0 upgrade 
>> is very
>> easy: if you are already planning on a move to 4.2, consider going to 
>> 5.0 instead. You can defer reindexing until you are ready.
>> 
>> Assuming you can use it, the task-rebalancer should be faster than 
>> XQSync - as long as you allocate enough task server threads. The 
>> project also provides an example module to 'evacuate' a forest. You 
>> could modify it to evacuate your old forests, which would populate your
> new one(s).
>> 
>> Along those lines, I wouldn't let a single forest grow without bounds 
>> and multiple devices are good for performance. Generally speaking I 
>> try to make sure the database has:
>> 
>> * 1 forest per 2 CPU cores
>> * no more than 200-GB each
>> * no more than 32M documents each
>> * no more than 2 forests per filesystem
>> * no more than 1 forest per spindle
>> 
>> Some of these rules depend on the situation, too. With positions 
>> enabled, for example, I would try for something closer to 8M documents 
>> or 100-GB, whichever comes first. With RAID-1 or RAID-10 I would count 
>> drive-pairs as spindles. With RAID-5 or RAID-6 I would count RAID groups
> as spindles.
>> 
>> -- Mike
>> 
>> On 10 Jul 2012, at 19:15 , Tim Meagher wrote:
>> 
>>> Hi Folks,
>>> 
>>> I have over 150 GB of content in one database that is currently 
>>> spread unevenly across 3 forests on 3 separate devices.  I need to 
>>> migrate this content to a new database which uses one device with 
>>> more than enough space for all of the content.  Since there is only 
>>> one device, I'm wondering if there is any advantage or disadvantage 
>>> to using multiple forests.  I think I should be able to simply copy 
>>> the content by creating 3 forests in the new database and copying the 
>>> forests over, but I'd like to know if this is not an optimal solution 
>>> in which case I will need to be a little more resourceful about 
>>> copying the content
>> over.  Perhaps xqsync?
>>> 
>>> Tim Meagher
>>> 
>>> P.S. Using MOvign content from ML 4.1 to ML 4.2.  Sorry, not yet at 5!
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
>> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to