A single database will be the most efficient configuration for storage and
query. One database per customer will be the most flexible operationally - or
better yet one VM per customer. You can land somewhere in the middle by
combining some customers and separating others.
Customer-specific configuration doesn't sound too challenging to me. I'd use a
well-known URI prefix, so for example a dictionary might be
'/customer-config/{$CUSTOMER}/dictionary.xml'. You always know the customer id,
so any customer's settings and metadata are just a doc($uri) away. I'd likely
do something similar with '/customer-data/{$CUSTOMER}/' for the customer's
data. Or '/customer/{$CUSTOMER}/config/' and '/customer/{$CUSTOMER}/data/'
would also work, and should give you an extra layer of customer isolation.
Having a large number of app servers should be fine. My laptop has over twenty
right now. Each one listens on a TCP/IP port, and you have tens of thousands
available. As you get into hundreds the admin UI may creak a bit, but that's
fixable.
Forest configuration is about as automatic as it can be, I think. At the most
basic level a forest is a filesystem directory, something like
/var/opt/MarkLogic/Forests/Security or /mnt/sdc/mldata/Forests/forest-1. You
specify the data directory for each forest, or it defaults to
/var/opt/MarkLogic. If the database is small, you can create one forest pretty
much anywhere and it will be fine. But with large databases you'll usually want
to spread work across multiple storage paths, and you'll also want multiple
forests for greater update concurrency. So this is basically a question of
resource allocation, and forests are the model for allocating physical storage
paths to a database.
My usual recommendation is to configure one forest per two CPU cores per host.
So a 24-core host gets 12 forests. I prefer to configure a separate filesystem
and block device for each forest, or no more than 2-4 forests per filesystem
and block device. I do that because a busy forest can develop long I/O queues,
which are specific to a block device. As a practical matter I find that
multiple I/O queues perform better.
-- Mike
On 7 Jul 2014, at 12:59 , Casey Jordan <[email protected]> wrote:
> Thanks again Mike for the detailed thoughts.
>
> So are you suggesting that big clients get their own database and smaller
> clients are shared? I think one big concern for me is the development
> overhead around a shared database, there is just a lot more things to
> consider. For instance things like dictionaries, language settings and other
> configuration. The separate database model keeps this very clean. On the
> other hand it brings up issues like the need to create App Servers for each
> database, which means probably 2-3 App servers for each client. Could this be
> a major issue?
>
> Also, regarding your RAID analogy, I guess I don't understand why we create
> multiple forests at all if the db is going to distribute data across them
> automatically. Given how I understand them, why wouldn't the database manage
> all of this internally?
>
>
>
> On Mon, Jul 7, 2014 at 3:27 PM, Michael Blakeley <[email protected]> wrote:
> That suggests a raw tree size of about 5-GB for a large customer. With a high
> level of text indexing it might approach 20-GB or even 40-GB. That's a medium
> size for a forest. It's best to limit them to about 200-GB, but short of that
> a larger forest is more efficient than a smaller one. Since those are your
> larger customers, that suggests you could combine quite a few smaller
> customers. To me this points to a shared database.
>
> Forest storage is basically schemaless. Simply ingesting XML doesn't validate
> it against a schema. You do that explicitly using a validate { ... }
> expression. It's possible to make that happen using a trigger, if you want
> automatic validation. But usually it's better to accept documents even when
> they don't validate, so that fixing them is a database operation.
>
> Your next question may be: should I map specific customers to specific
> forests? Usually no. Usually it's better to let the database spread documents
> around. Think of the forests as disks in a RAID volume, rather than
> sub-databases.
>
> -- Mike
>
> On 7 Jul 2014, at 10:58 , Casey Jordan <[email protected]> wrote:
>
> > Thanks, I figured that there would be more resources that were not shared
> > when having multiple dbs. That being said, I am not sure it would be a big
> > impact in my case. I would say that a big client might have 500k documents
> > that are around 10kb each.
> >
> > Also, another consideration is that each client needs to have separate
> > schemas for their content. So this might force me into the multi db design.
> > Unless I made the default content store forest schemaless
> >
> > Is it even possible to have a schemaless forest?
> >
> >
> > On Mon, Jul 7, 2014 at 1:37 PM, Gene Thomas <[email protected]> wrote:
> > I think the overall performance would be best with your content in separate
> > databases.
> >
> > Gene
> >
> >
> > On Monday, July 7, 2014 10:33 AM, Casey Jordan <[email protected]>
> > wrote:
> >
> >
> > Thanks guys that is really helpful information.
> >
> > Is there any significant performance or resource tradeoffs when choosing
> > between putting everything in one big database vs splitting it into one for
> > each "client"? Personally I like the idea of keeping everything as separate
> > as possible, but if this mean that it had some major tradeoff that would be
> > good to know.
> >
> >
> > On Mon, Jul 7, 2014 at 1:28 PM, Justin Makeig <[email protected]>
> > wrote:
> > Casey,
> > There are two ways in MarkLogic 7 to query a specific database: Create a
> > separate app server (HTTP or XDBC) for each database. An app server has a
> > default database that you can set in configuration. Each query/update
> > evaluated for that app server runs against that database. Many app servers
> > can point to one database, but an app server can only be associated with
> > one database. Another, lower-level means is to use xdmp:eval
> > <http://docs.marklogic.com/xdmp:eval?q=xdmp:eval> or xdmp:invoke. These
> > allow you to specify a database at runtime and evaluate specific code
> > against it. I wouldn't recommend this as a general approach, though. It
> > will make your code less readable and, in certain scenarios, will prevent
> > MarkLogic from maximizing some performance optimizations it does under the
> > covers.
> >
> > Another approach might be to create protected collections for each "tenant"
> > within the same database. With MarkLogic's role-based security, you can be
> > assured that you can completely restrict viewing and editing to very
> > specific roles. You can take a similar approach to running privileged code
> > with amps. Take a look at the Security Guide for more details
> > <http://docs.marklogic.com/guide/admin/security#chapter>.
> >
> > Justin
> >
> >
> >
> > Justin Makeig
> > Director, Product Management
> > MarkLogic Corporation
> > [email protected]
> > www.marklogic.com
> >
> >
> >
> > On Jul 7, 2014, at 10:14 AM, Casey Jordan <[email protected]> wrote:
> >
> >> Hi all,
> >>
> >> I am checking out Mark Logic for the first time and I was interested if
> >> there is any information around designing a cluster for multi-tenancy?
> >>
> >> I assumed that I could create a separate database for each "client" that
> >> would be using the application, and then segment data that way. However
> >> right away it became a little unclear to me as to how I query a specific
> >> database (couldn't find an example of this in the docs), or manage users,
> >> triggers, schemas etc for a specific database.
> >>
> >> I know this is a fairly general question, but any advice would be helpful.
> >>
> >> Thanks
> >>
> >> --
> >> --
> >> Casey Jordan
> >> easyDITA a product of Jorsek LLC
> >> "CaseyDJordan" on LinkedIn, Twitter & Facebook
> >> (585) 348 7399
> >> easydita.com
> >>
> >>
> >> This message is intended only for the use of the Addressee(s) and may
> >> contain information that is privileged, confidential, and/or exempt from
> >> disclosure under applicable law. If you are not the intended recipient,
> >> please be advised that any disclosure copying, distribution, or use of
> >> the information contained herein is prohibited. If you have received
> >> this communication in error, please destroy all copies of the message,
> >> whether in electronic or hard copy format, as well as attachments, and
> >> immediately contact the sender by replying to this e-mail or by phone.
> >> Thank you.
> >> _______________________________________________
> >> General mailing list
> >> [email protected]
> >> http://developer.marklogic.com/mailman/listinfo/general
> >
> >
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> >
> >
> >
> > --
> > --
> > Casey Jordan
> > easyDITA a product of Jorsek LLC
> > "CaseyDJordan" on LinkedIn, Twitter & Facebook
> > (585) 348 7399
> > easydita.com
> >
> >
> > This message is intended only for the use of the Addressee(s) and may
> > contain information that is privileged, confidential, and/or exempt from
> > disclosure under applicable law. If you are not the intended recipient,
> > please be advised that any disclosure copying, distribution, or use of
> > the information contained herein is prohibited. If you have received
> > this communication in error, please destroy all copies of the message,
> > whether in electronic or hard copy format, as well as attachments, and
> > immediately contact the sender by replying to this e-mail or by phone.
> > Thank you.
> >
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> >
> >
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
> >
> >
> >
> >
> > --
> > --
> > Casey Jordan
> > easyDITA a product of Jorsek LLC
> > "CaseyDJordan" on LinkedIn, Twitter & Facebook
> > (585) 348 7399
> > easydita.com
> >
> >
> > This message is intended only for the use of the Addressee(s) and may
> > contain information that is privileged, confidential, and/or exempt from
> > disclosure under applicable law. If you are not the intended recipient,
> > please be advised that any disclosure copying, distribution, or use of
> > the information contained herein is prohibited. If you have received
> > this communication in error, please destroy all copies of the message,
> > whether in electronic or hard copy format, as well as attachments, and
> > immediately contact the sender by replying to this e-mail or by phone.
> > Thank you.
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://developer.marklogic.com/mailman/listinfo/general
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
>
>
>
> --
> --
> Casey Jordan
> easyDITA a product of Jorsek LLC
> "CaseyDJordan" on LinkedIn, Twitter & Facebook
> (585) 348 7399
> easydita.com
>
>
> This message is intended only for the use of the Addressee(s) and may
> contain information that is privileged, confidential, and/or exempt from
> disclosure under applicable law. If you are not the intended recipient,
> please be advised that any disclosure copying, distribution, or use of
> the information contained herein is prohibited. If you have received
> this communication in error, please destroy all copies of the message,
> whether in electronic or hard copy format, as well as attachments, and
> immediately contact the sender by replying to this e-mail or by phone.
> Thank you.
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general