While I'm at it, another question. Assuming it works well to setup a single E+D big-RAM system that runs super fast because it does virtually no I/O, where is the upper limit to scalability as concurrent requests goes up?
Such a system in AWS would probably have 8-16 cores. Even with nearly frictionless data access, there will be an upper bound to how many queries can be evaluated in a given unit of time. How do we determine the cross-over point where it's better to add E-nodes to spread the CPU load and let the big-RAM guy focus on data access? Would it make sense to go with n*E + 1*D from the start where n can be dialed up and down easily? Or go with one monolithic E+D and just replicate it as the load goes up? The usage profile is likely to have peaks and valleys at fairly predictable times during the day/week. On Feb 16, 2013, at 8:37 PM, Michael Blakeley <[email protected]> wrote: > Ron, I believe that if you *can* fit your application onto one host without > ridiculous hardware costs, you should. There are many actual and potential > optimizations that can be done for local forests, but break down once the > forests are on different hosts. > > On Linux vs Windows: I think you'll have more control on linux. You can tune > the OS I/O caching and swappiness, as well as dirty page management. It might > also help to know that default OS I/O caching behaves quite a bit differently > between the two. I've done tests where I've set the compressed-tree cache > size as low as 1-MB on linux, without measurable performance impact. The OS > buffer cache already does pretty much the same job, so it picks up the slack. > As I understand it, this would not be true on Windows. > > Getting everything into RAM is an interesting problem. You can turn on > "pre-map data" for the database, although that makes forest mounts quite I/O > intensive. If possible use huge-pages so the OS can more easily manage all > the RAM you're planning to use. Next I think you'll want to figure out how > your forests break out into mappable files, ListData, and TreeData. You'll > need available RAM for the mappable files, list-cache size equal to the > ListData, compressed-tree cache size equal to TreeData, and expanded > tree-cache equal to about 3x TreeData. Often ListData is something like 75% > of forest size, TreeData is something like 20%, and mappable files make up > the rest - but your database will likely be a little different, so measure it. > > But with those proportions a 10000-MiB database would call for something like > 7500-MiB list cache, 2000-MiB compressed tree, 6000-MiB expanded-tree, plus > 500-MiB for mapped files: total about 1.6x database size. Ideally you'd have > *another* 1x so that the OS buffer cache could also cache all the files. > You'll also want to account for OS, merges, and in-memory stands for those > infrequent updates. I expect 32-GB would be more than enough for this > example. As noted above you could skimp on compressed-tree if the OS buffer > cache will do something similar, but that's a marginal savings. > > At some point in all this, you'll want to copy all the data into RAM. Slow > I/O will impede that, especially if everything is done using random reads. > One way to manage this is to use a ramdisk, but that has the disadvantage of > requiring another large slice of memory: probably more than 2x to allow merge > space, since you will have some updates. You could easily load up the OS > buffer cache with something like 'find /mldata | xargs wc', as long as the > buffer cache doesn't outsmart you. On linux where fadvise is available, using > 'willneed' should help. This approach doesn't load the tree caches, but > database reads should come from the buffer cache instead of driving more I/O. > > If you're testing the buffer cache in linux, it helps to know how to clear > it: http://www.kernel.org/doc/Documentation/sysctl/vm.txt describes the > 'drop_caches' sysctl for this. > > Once the OS buffer cache is populated you could just let the list and tree > caches populate themselves from there, and that might be best. But by > definition you have enough tree cache to hold the entire database, so you > could load it up with something like 'xdmp:eval("collection()")[last()]'. The > xdmp:eval ensures a non-streaming context, necessary because streaming > bypasses the tree caches. > > I can't really think of a good way to pre-populate the list cache. You could > cts:search with random terms, but that would be slow and there would be no > guarantee of completeness. Another approach would be to build a word-query of > every word in the word-lexicon, and then estimate it. That won't be complete > either, since it won't exercise non-word terms. Probably it's best to ensure > that the ListData is in OS buffer cache, and leave it at that. > > -- Mike > > On 16 Feb 2013, at 10:50 , Ron Hitchens <[email protected]> wrote: > >> >> I'm trying to work out the best way to deploy a system >> I'm designing into the cloud on AWS. We've been through >> various permutations of AWS configurations and the main >> thing we've learned is that there is a lot of uncertainty >> and unpredictability around I/O performance in AWS. >> >> It's relatively expensive to provision guaranteed, high >> performance I/O. We're testing an SSD solution at the >> moment, but that is ephemeral (lost if the VM shuts down) >> and very expensive. That's not a deal-killer for our >> architecture, but makes it more complicated to deploy >> and strains the ops budget. >> >> RAM, on the other hand, is relatively cheap to add to >> and AWS instance. The total database size, at present, is >> under 20GB and will grow relatively slowly. Provisioning >> an AWS instance with ~64GB of RAM is fairly cost effective, >> but the persistent EBS storage is sloooow. >> >> So, I have two questions: >> >> 1) Is there a best practice to tune MarkLogic where >> RAM is plentiful (twice the size of the data or more) so >> as to maximize caching of data. Ideally, we'd like the >> whole database loaded into RAM. This system will run as >> a read-only replica of a master database located elsewhere. >> The goal is to maximize query performance, but updates of >> relatively low frequency will be coming in from the master. >> >> The client is a Windows shop, but Linux is an approved >> solution if need be. Are there exploitable differences at >> the OS level that can improve filesystem caching? Are there >> RAM disk or configuration tricks that would maximize RAM >> usage without affecting update persistence? >> >> 2) Given #1 could lead to a mostly RAM-based configuration, >> does it make sense to go with a single high-RAM, high-CPU >> E+D-node that serves all requests with little or no actual I/O? >> Or would it be an overall win to cluster E-nodes in front of >> the big-RAM D-node to offload query evaluation and pay the >> (10-gb) network latency penalty for inter-node comms? >> >> We do have the option of deploying multiple standalone >> big-RAM E+D-nodes, each of which is a full replica of the data >> from the master. This would basically give us the equivalent >> of failover redundancy, but at the load balancer level rather >> than within the cluster. This would also let us disperse >> them across AZs and regions without worrying about split-brain >> cluster issues. >> >> Thoughts? Recommendations? >> >> --- >> Ron Hitchens {mailto:[email protected]} Ronsoft Technologies >> +44 7879 358 212 (voice) http://www.ronsoft.com >> +1 707 924 3878 (fax) Bit Twiddling At Its Finest >> "No amount of belief establishes any fact." -Unknown >> >> >> >> >> _______________________________________________ >> General mailing list >> [email protected] >> http://developer.marklogic.com/mailman/listinfo/general >> > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general --- Ron Hitchens {mailto:[email protected]} Ronsoft Technologies +44 7879 358 212 (voice) http://www.ronsoft.com +1 707 924 3878 (fax) Bit Twiddling At Its Finest "No amount of belief establishes any fact." -Unknown _______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
