Re: [MarkLogic Dev General] RAM Rich, I/O Poor in the Cloud

Michael Blakeley Sat, 16 Feb 2013 14:54:01 -0800

It's pretty difficult to estimate E vs D without deep knowledge of the queries. 
But you could probably set up a testbed and measure it directly.


In the absence of data, I would plan on using an elastic load balancer (ELB) 
from day one. That way you can experiment with E-hosts without breaking 
anything, and also add replicas without breaking anything. Plus it's convenient 
to have that reliable, external DNS entry.

-- Mike

On 16 Feb 2013, at 13:47 , Ron Hitchens <[email protected]> wrote:

> 
>   While I'm at it, another question.  Assuming it works well
> to setup a single E+D big-RAM system that runs super fast
> because it does virtually no I/O, where is the upper limit
> to scalability as concurrent requests goes up?
> 
>   Such a system in AWS would probably have 8-16 cores.
> Even with nearly frictionless data access, there will be
> an upper bound to how many queries can be evaluated in a
> given unit of time.  How do we determine the cross-over
> point where it's better to add E-nodes to spread the CPU
> load and let the big-RAM guy focus on data access?
> 
>   Would it make sense to go with n*E + 1*D from the start
> where n can be dialed up and down easily?  Or go with one
> monolithic E+D and just replicate it as the load goes up?
> The usage profile is likely to have peaks and valleys
> at fairly predictable times during the day/week.
> 
> On Feb 16, 2013, at 8:37 PM, Michael Blakeley <[email protected]> wrote:
> 
>> Ron, I believe that if you *can* fit your application onto one host without 
>> ridiculous hardware costs, you should. There are many actual and potential 
>> optimizations that can be done for local forests, but break down once the 
>> forests are on different hosts.
>> 
>> On Linux vs Windows: I think you'll have more control on linux. You can tune 
>> the OS I/O caching and swappiness, as well as dirty page management. It 
>> might also help to know that default OS I/O caching behaves quite a bit 
>> differently between the two. I've done tests where I've set the 
>> compressed-tree cache size as low as 1-MB on linux, without measurable 
>> performance impact. The OS buffer cache already does pretty much the same 
>> job, so it picks up the slack. As I understand it, this would not be true on 
>> Windows.
>> 
>> Getting everything into RAM is an interesting problem. You can turn on 
>> "pre-map data" for the database, although that makes forest mounts quite I/O 
>> intensive. If possible use huge-pages so the OS can more easily manage all 
>> the RAM you're planning to use. Next I think you'll want to figure out how 
>> your forests break out into mappable files, ListData, and TreeData. You'll 
>> need available RAM for the mappable files, list-cache size equal to the 
>> ListData, compressed-tree cache size equal to TreeData, and expanded 
>> tree-cache equal to about 3x TreeData. Often ListData is something like 75% 
>> of forest size, TreeData is something like 20%, and mappable files make up 
>> the rest - but your database will likely be a little different, so measure 
>> it.
>> 
>> But with those proportions a 10000-MiB database would call for something 
>> like 7500-MiB list cache, 2000-MiB compressed tree, 6000-MiB expanded-tree, 
>> plus 500-MiB for mapped files: total about 1.6x database size. Ideally you'd 
>> have *another* 1x so that the OS buffer cache could also cache all the 
>> files. You'll also want to account for OS, merges, and in-memory stands for 
>> those infrequent updates. I expect 32-GB would be more than enough for this 
>> example. As noted above you could skimp on compressed-tree if the OS buffer 
>> cache will do something similar, but that's a marginal savings.
>> 
>> At some point in all this, you'll want to copy all the data into RAM. Slow 
>> I/O will impede that, especially if everything is done using random reads. 
>> One way to manage this is to use a ramdisk, but that has the disadvantage of 
>> requiring another large slice of memory: probably more than 2x to allow 
>> merge space, since you will have some updates. You could easily load up the 
>> OS buffer cache with something like 'find /mldata | xargs wc', as long as 
>> the buffer cache doesn't outsmart you. On linux where fadvise is available, 
>> using 'willneed' should help. This approach doesn't load the tree caches, 
>> but database reads should come from the buffer cache instead of driving more 
>> I/O.
>> 
>> If you're testing the buffer cache in linux, it helps to know how to clear 
>> it: http://www.kernel.org/doc/Documentation/sysctl/vm.txt describes the 
>> 'drop_caches' sysctl for this.
>> 
>> Once the OS buffer cache is populated you could just let the list and tree 
>> caches populate themselves from there, and that might be best. But by 
>> definition you have enough tree cache to hold the entire database, so you 
>> could load it up with something like 'xdmp:eval("collection()")[last()]'. 
>> The xdmp:eval ensures a non-streaming context, necessary because streaming 
>> bypasses the tree caches.
>> 
>> I can't really think of a good way to pre-populate the list cache. You could 
>> cts:search with random terms, but that would be slow and there would be no 
>> guarantee of completeness. Another approach would be to build a word-query 
>> of every word in the word-lexicon, and then estimate it. That won't be 
>> complete either, since it won't exercise non-word terms. Probably it's best 
>> to ensure that the ListData is in OS buffer cache, and leave it at that.
>> 
>> -- Mike
>> 
>> On 16 Feb 2013, at 10:50 , Ron Hitchens <[email protected]> wrote:
>> 
>>> 
>>> I'm trying to work out the best way to deploy a system
>>> I'm designing into the cloud on AWS.  We've been through
>>> various permutations of AWS configurations and the main
>>> thing we've learned is that there is a lot of uncertainty
>>> and unpredictability around I/O performance in AWS.
>>> 
>>> It's relatively expensive to provision guaranteed, high
>>> performance I/O.  We're testing an SSD solution at the
>>> moment, but that is ephemeral (lost if the VM shuts down)
>>> and very expensive.  That's not a deal-killer for our
>>> architecture, but makes it more complicated to deploy
>>> and strains the ops budget.
>>> 
>>> RAM, on the other hand, is relatively cheap to add to
>>> and AWS instance.  The total database size, at present, is
>>> under 20GB and will grow relatively slowly.  Provisioning
>>> an AWS instance with ~64GB of RAM is fairly cost effective,
>>> but the persistent EBS storage is sloooow.
>>> 
>>> So, I have two questions:
>>> 
>>> 1) Is there a best practice to tune MarkLogic where
>>> RAM is plentiful (twice the size of the data or more) so
>>> as to maximize caching of data.  Ideally, we'd like the
>>> whole database loaded into RAM.  This system will run as
>>> a read-only replica of a master database located elsewhere.
>>> The goal is to maximize query performance, but updates of
>>> relatively low frequency will be coming in from the master.
>>> 
>>> The client is a Windows shop, but Linux is an approved
>>> solution if need be.  Are there exploitable differences at
>>> the OS level that can improve filesystem caching?  Are there
>>> RAM disk or configuration tricks that would maximize RAM
>>> usage without affecting update persistence?
>>> 
>>> 2) Given #1 could lead to a mostly RAM-based configuration,
>>> does it make sense to go with a single high-RAM, high-CPU
>>> E+D-node that serves all requests with little or no actual I/O?
>>> Or would it be an overall win to cluster E-nodes in front of
>>> the big-RAM D-node to offload query evaluation and pay the
>>> (10-gb) network latency penalty for inter-node comms?
>>> 
>>> We do have the option of deploying multiple standalone
>>> big-RAM E+D-nodes, each of which is a full replica of the data
>>> from the master.  This would basically give us the equivalent
>>> of failover redundancy, but at the load balancer level rather
>>> than within the cluster.  This would also let us disperse
>>> them across AZs and regions without worrying about split-brain
>>> cluster issues.
>>> 
>>> Thoughts?  Recommendations?
>>> 
>>> ---
>>> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>>>  +44 7879 358 212 (voice)          http://www.ronsoft.com
>>>  +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
>>> "No amount of belief establishes any fact." -Unknown
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> General mailing list
>>> [email protected]
>>> http://developer.marklogic.com/mailman/listinfo/general
>>> 
>> 
>> _______________________________________________
>> General mailing list
>> [email protected]
>> http://developer.marklogic.com/mailman/listinfo/general
> 
> ---
> Ron Hitchens {mailto:[email protected]}   Ronsoft Technologies
>     +44 7879 358 212 (voice)          http://www.ronsoft.com
>     +1 707 924 3878 (fax)              Bit Twiddling At Its Finest
> "No amount of belief establishes any fact." -Unknown
> 
> 
> 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] RAM Rich, I/O Poor in the Cloud

Reply via email to