Re: [MarkLogic Dev General] RAM Rich, I/O Poor in the Cloud

Michael Blakeley Sat, 16 Feb 2013 12:37:56 -0800

Ron, I believe that if you *can* fit your application onto one host without 
ridiculous hardware costs, you should. There are many actual and potential 
optimizations that can be done for local forests, but break down once the 
forests are on different hosts.

On Linux vs Windows: I think you'll have more control on linux. You can tune 
the OS I/O caching and swappiness, as well as dirty page management. It might 
also help to know that default OS I/O caching behaves quite a bit differently 
between the two. I've done tests where I've set the compressed-tree cache size 
as low as 1-MB on linux, without measurable performance impact. The OS buffer 
cache already does pretty much the same job, so it picks up the slack. As I 
understand it, this would not be true on Windows.

Getting everything into RAM is an interesting problem. You can turn on "pre-map 
data" for the database, although that makes forest mounts quite I/O intensive. 
If possible use huge-pages so the OS can more easily manage all the RAM you're 
planning to use. Next I think you'll want to figure out how your forests break 
out into mappable files, ListData, and TreeData. You'll need available RAM for 
the mappable files, list-cache size equal to the ListData, compressed-tree 
cache size equal to TreeData, and expanded tree-cache equal to about 3x 
TreeData. Often ListData is something like 75% of forest size, TreeData is 
something like 20%, and mappable files make up the rest - but your database 
will likely be a little different, so measure it.

But with those proportions a 10000-MiB database would call for something like 
7500-MiB list cache, 2000-MiB compressed tree, 6000-MiB expanded-tree, plus 
500-MiB for mapped files: total about 1.6x database size. Ideally you'd have 
*another* 1x so that the OS buffer cache could also cache all the files. You'll 
also want to account for OS, merges, and in-memory stands for those infrequent 
updates. I expect 32-GB would be more than enough for this example. As noted 
above you could skimp on compressed-tree if the OS buffer cache will do 
something similar, but that's a marginal savings.

At some point in all this, you'll want to copy all the data into RAM. Slow I/O 
will impede that, especially if everything is done using random reads. One way 
to manage this is to use a ramdisk, but that has the disadvantage of requiring 
another large slice of memory: probably more than 2x to allow merge space, 
since you will have some updates. You could easily load up the OS buffer cache 
with something like 'find /mldata | xargs wc', as long as the buffer cache 
doesn't outsmart you. On linux where fadvise is available, using 'willneed' 
should help. This approach doesn't load the tree caches, but database reads 
should come from the buffer cache instead of driving more I/O.

If you're testing the buffer cache in linux, it helps to know how to clear it: 
http://www.kernel.org/doc/Documentation/sysctl/vm.txt describes the 
'drop_caches' sysctl for this.

Once the OS buffer cache is populated you could just let the list and tree 
caches populate themselves from there, and that might be best. But by 
definition you have enough tree cache to hold the entire database, so you could 
load it up with something like 'xdmp:eval("collection()")[last()]'. The 
xdmp:eval ensures a non-streaming context, necessary because streaming bypasses 
the tree caches.

I can't really think of a good way to pre-populate the list cache. You could 
cts:search with random terms, but that would be slow and there would be no 
guarantee of completeness. Another approach would be to build a word-query of 
every word in the word-lexicon, and then estimate it. That won't be complete 
either, since it won't exercise non-word terms. Probably it's best to ensure 
that the ListData is in OS buffer cache, and leave it at that.

-- Mike

On 16 Feb 2013, at 10:50 , Ron Hitchens <[email protected]> wrote:

> 
>  I'm trying to work out the best way to deploy a system
> I'm designing into the cloud on AWS.  We've been through
> various permutations of AWS configurations and the main
> thing we've learned is that there is a lot of uncertainty
> and unpredictability around I/O performance in AWS.
> 
>  It's relatively expensive to provision guaranteed, high
> performance I/O.  We're testing an SSD solution at the
> moment, but that is ephemeral (lost if the VM shuts down)
> and very expensive.  That's not a deal-killer for our
> architecture, but makes it more complicated to deploy
> and strains the ops budget.
> 
>  RAM, on the other hand, is relatively cheap to add to
> and AWS instance.  The total database size, at present, is
> under 20GB and will grow relatively slowly.  Provisioning
> an AWS instance with ~64GB of RAM is fairly cost effective,
> but the persistent EBS storage is sloooow.
> 
>  So, I have two questions:
> 
>  1) Is there a best practice to tune MarkLogic where
> RAM is plentiful (twice the size of the data or more) so
> as to maximize caching of data.  Ideally, we'd like the
> whole database loaded into RAM.  This system will run as
> a read-only replica of a master database located elsewhere.
> The goal is to maximize query performance, but updates of
> relatively low frequency will be coming in from the master.
> 
>  The client is a Windows shop, but Linux is an approved
> solution if need be.  Are there exploitable differences at
> the OS level that can improve filesystem caching?  Are there
> RAM disk or configuration tricks that would maximize RAM
> usage without affecting update persistence?
> 
>  2) Given #1 could lead to a mostly RAM-based configuration,
> does it make sense to go with a single high-RAM, high-CPU
> E+D-node that serves all requests with little or no actual I/O?
> Or would it be an overall win to cluster E-nodes in front of
> the big-RAM D-node to offload query evaluation and pay the
> (10-gb) network latency penalty for inter-node comms?
> 
>  We do have the option of deploying multiple standalone
> big-RAM E+D-nodes, each of which is a full replica of the data
> from the master.  This would basically give us the equivalent
> of failover redundancy, but at the load balancer level rather
> than within the cluster.  This would also let us disperse
> them across AZs and regions without worrying about split-brain
> cluster issues.
> 
>  Thoughts?  Recommendations?
> 
> ---
> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>    +44 7879 358 212 (voice)          http://www.ronsoft.com
>    +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
> 
> 
> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] RAM Rich, I/O Poor in the Cloud

Reply via email to