We usually look at sizing questions from a timing and load perspective first. How many queries per sec on average and peak, and how many inserts per sec on average and peak?
With a given sample set, you can often get estimates on read/write IO, which is one of the biggest bottle neck in most cases, particularly for inserts. The expected IO bandwidth versus available IO bandwidth per host typically gives an indication how many hosts you need to reach the ingest speed you are after. Querying however should be less IO bound, because ideally you try to run from indexes as much as possible. More forests helps speed up querying because index lookups can be parallelized. The number of forests is linked to the number of cores though, like you suggest. It is not a 1 on 1 relation, though. Rough thumb rule is 1 or 2 cores per forest. 1 if is it mostly querying or inserting only, 2 if both happen at the same time a lot. That is for bigger forests though. You can probably push it a bit if the forests are tiny, and/or used only limited during a day. I think I currently have almost 150 forests on my 16 core laptop, 3 to 5 for each demo that i happen to have installed. That only works because i rarely use more than one demo at the same time. In the end I think IO bandwidth is more important than the number of forests. Also keep in mind that scaling up and down is relatively easy with MarkLogic. If you start doing metrics on performance, you should get a good feel of how your system would hold up, if you start increasing load. Cheers, Geert From: <[email protected]<mailto:[email protected]>> on behalf of Andreas Hubmer <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Wednesday, November 29, 2017 at 1:19 PM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] Multi-Database Architecture Actually, it is the other way around. MarkLogic prefers multiple forests above a single forest... Don’t put too many forests on a single host though, or they will just compete for resources. Where would you draw the border between preferring many small forests and not creating too many forests on a host? Would you use the expected forest size as indicator? (eg. no forest < 1gb) Or would you try to create not more forests than cpu-cores /2 per host? Thanks, Andreas 2017-11-28 12:38 GMT+01:00 Geert Josten <[email protected]<mailto:[email protected]>>: Actually, it is the other way around. MarkLogic prefers multiple forests above a single forest. Each forest has its own in-memory stand, and MarkLogic prefers multiple smaller ones above one big one. The idea is that it allows parallelizing the workload to resolve from indexes, and also be able to pull content from disk in parallel (particularly if multiple hosts, or disks/controllers are involved). Don’t put too many forests on a single host though, or they will just compete for resources. Also note that a forest is not the same as a database. Each database will have at least one forest, but could have many more, potentially spread out over multiple hosts. So, one big database, or multiple small ones could end up resulting in the same in-memory stand sizes. It all depends on how many forests each database has, and how much data is inside them. Whether it makes most sense to use one shared db, or multiple small ones, that really is a functional/business question primarily. I’d add though, that I’d personally prefer built-in backup over MLCP for backups.. Cheers, Geert From: <[email protected]<mailto:[email protected]>> on behalf of Andreas Hubmer <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Tuesday, November 28, 2017 at 10:59 AM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: Re: [MarkLogic Dev General] Multi-Database Architecture Hi, The clients are different services in a larger micro-service landscape. Some of them will store small amounts of data (less than 1GB, maybe even less than 100MB), others large amounts. The services with small amounts of data make me worry about efficient usage of memory and in-memory-stands. If they share a database, the shared database could have larger in-memory stands (in contrast to many small in-memory stands of the individual databases). I assume that larger in-memory stands perform much better in peak moments?! Additionally, it is easier to tune the configuration of one database vs. to tune the configuration of many databases. On the other hand, we want to have an easy backup & restore process. Do you have any suggestions or experience on how this could be done in a shared database on a directory level? The backup could be done with the MLCP (export, point-in-time). The restore with MLCP would be a step-process: remove all content from the directory, then import the backup. This is not as straight-forward as the builtin backup features. Security, SLAs and data sharing are relevant topics which I feel comfortable with. Maybe we'll go with a mix of shared and individual databases, even though this means a more complex architecture. Thanks, Andreas 2017-11-23 21:18 GMT+01:00 David Gorbet <[email protected]<mailto:[email protected]>>: If these are completely separate use cases please consider completely separate clusters. You can use virtualization to make the hardware work out. On Nov 23, 2017, at 12:04 PM, Geert Josten <[email protected]<mailto:[email protected]>> wrote: Hi Andreas, I think each forest has its own in-memory stand, so if each client has a reasonable amount of data, you’ll need several forests per client anyhow. One or multiple databases wouldn’t matter much in that case. I wouldn’t worry too much about in-memory stands though. Memory is much faster than disk, so worth using. And you’ll want spare resources anyhow to handle peak moments, so not fully utilizing resources all the time isn’t bad necessarily. An average use of 30% of cpu and mem is pretty typical i’d say. I would suggest looking at it more from a business or functional perspective. For instance: * Do you need to guarantee clients won’t be able to see each others data? That would be a strong argument to want to keep things separate without doubt. * Could different clients have different SLA terms? Another vote for keeping things separate. * What if one clients wants to step out, and you need to purge its data? Dead simple with separate databases * Is there any change one of the clients would like to run it on-site, rather than hosted? * Or for the opposite: would there be any need to mix datasets from different clients? Any kind of sharing for instance, even if just of statistics, or some anonymous cross-validation? And you can probably think of many more yourself. Cheers, Geert From: <[email protected]<mailto:[email protected]>> on behalf of Andreas Hubmer <[email protected]<mailto:[email protected]>> Reply-To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Date: Thursday, November 23, 2017 at 4:53 PM To: MarkLogic Developer Discussion <[email protected]<mailto:[email protected]>> Subject: [MarkLogic Dev General] Multi-Database Architecture Hi, I am planning the architecture of an application with dozens of individual clients. I think of using either one database for all data or a separate database per client. The main pros and cons for me are efficient memory usage and the possibility of individual backup&restore. I tend to prefer the first and accept more complicated restore scenarios. These are my considerations. one-db: * each client would use a different base directory (security: uri-privileges) * 1 in-memory-stand -> more efficient memory usage. Do you agree that this is relevant? * individual backup & restore of data of one client => complicated (MLCP?) many-dbs (one db per client): * many in-memory-stands -> less efficient memory usage / more smaller stands / more merging. Do you agree? * builtin backup & restore of data of one client is possible * very flexible configuration (individual indexes, ...) * deployment more complex For configuration we will use Roxy. Thanks, Andreas -- Andreas Hubmer Senior IT Consultant EBCONT enterprise technologies GmbH Millennium Tower Handelskai 94-96 A-1200 Vienna Mobile: +43 664 60651861<tel:+43%20664%2060651861> Fax: +43 2772 512 69-9 Email: [email protected]<mailto:[email protected]> Web: http://www.ebcont.com OUR TEAM IS YOUR SUCCESS UID-Nr. ATU68135644 HG St.Pölten - FN 399978 d VERTRAULICHKEITSHINWEIS/HAFTUNGSAUSSCHLUSS: Der Inhalt dieser E-Mail und alle beigefügten Anhänge sind vertraulich zu behandeln, sind vor Veröffentlichung rechtlich geschützt und sind ausschließlich für den bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Empfänger sind, informieren Sie den Absender bitte umgehend und vernichten Sie diese E-Mail samt allen beigefügten Anhängen. Der Inhalt dieser Email darf nicht an/oder von dritten weitergeleitet, veröffentlicht, verwendet, kopiert oder auf andere Medien gespeichert werden. Wir übernehmen keine Haftung für eventuelle Schäden, die durch diese E-Mail oder deren Anhänge entstehen könnten. CONFIDENTIALITY/DISCLAIMER: This email and any files transmitted with it are confidential, are legally protected before publication and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the sender immediately and destroy this e-mail together with all attachments. The content of this e-mail may not be be disseminated, published, copied or stored on third parties. We assume no liability for any damage that may result from this e-mail or its annexes. _______________________________________________ General mailing list [email protected]<mailto:[email protected]> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general _______________________________________________ General mailing list [email protected]<mailto:[email protected]> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general -- Andreas Hubmer Senior IT Consultant EBCONT enterprise technologies GmbH Millennium Tower Handelskai 94-96 A-1200 Vienna Mobile: +43 664 60651861<tel:+43%20664%2060651861> Fax: +43 2772 512 69-9 Email: [email protected]<mailto:[email protected]> Web: http://www.ebcont.com OUR TEAM IS YOUR SUCCESS UID-Nr. ATU68135644 HG St.Pölten - FN 399978 d VERTRAULICHKEITSHINWEIS/HAFTUNGSAUSSCHLUSS: Der Inhalt dieser E-Mail und alle beigefügten Anhänge sind vertraulich zu behandeln, sind vor Veröffentlichung rechtlich geschützt und sind ausschließlich für den bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Empfänger sind, informieren Sie den Absender bitte umgehend und vernichten Sie diese E-Mail samt allen beigefügten Anhängen. Der Inhalt dieser Email darf nicht an/oder von dritten weitergeleitet, veröffentlicht, verwendet, kopiert oder auf andere Medien gespeichert werden. Wir übernehmen keine Haftung für eventuelle Schäden, die durch diese E-Mail oder deren Anhänge entstehen könnten. CONFIDENTIALITY/DISCLAIMER: This email and any files transmitted with it are confidential, are legally protected before publication and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the sender immediately and destroy this e-mail together with all attachments. The content of this e-mail may not be be disseminated, published, copied or stored on third parties. We assume no liability for any damage that may result from this e-mail or its annexes. _______________________________________________ General mailing list [email protected]<mailto:[email protected]> Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general -- Andreas Hubmer Senior IT Consultant EBCONT enterprise technologies GmbH Millennium Tower Handelskai 94-96 A-1200 Vienna Mobile: +43 664 60651861 Fax: +43 2772 512 69-9 Email: [email protected]<mailto:[email protected]> Web: http://www.ebcont.com OUR TEAM IS YOUR SUCCESS UID-Nr. ATU68135644 HG St.Pölten - FN 399978 d VERTRAULICHKEITSHINWEIS/HAFTUNGSAUSSCHLUSS: Der Inhalt dieser E-Mail und alle beigefügten Anhänge sind vertraulich zu behandeln, sind vor Veröffentlichung rechtlich geschützt und sind ausschließlich für den bezeichneten Adressaten bestimmt. Wenn Sie nicht der vorgesehene Empfänger sind, informieren Sie den Absender bitte umgehend und vernichten Sie diese E-Mail samt allen beigefügten Anhängen. Der Inhalt dieser Email darf nicht an/oder von dritten weitergeleitet, veröffentlicht, verwendet, kopiert oder auf andere Medien gespeichert werden. Wir übernehmen keine Haftung für eventuelle Schäden, die durch diese E-Mail oder deren Anhänge entstehen könnten. CONFIDENTIALITY/DISCLAIMER: This email and any files transmitted with it are confidential, are legally protected before publication and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the sender immediately and destroy this e-mail together with all attachments. The content of this e-mail may not be be disseminated, published, copied or stored on third parties. We assume no liability for any damage that may result from this e-mail or its annexes.
_______________________________________________ General mailing list [email protected] Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general
