I just ran across this in the docs under Collection Performance, so it is added here as part of the discussion.
"A practical guideline is that a document with fragments averaging 50K in size should not belong to more than 100 collections. This should keep the average fragment size increase to less than 10%." On Wed, Dec 11, 2013 at 4:57 PM, Geert Josten <[email protected]> wrote: > Anyone from LDS listening to this thread? I recall they were doing lots of > user and role creation on the fly, for many many users. And I’m sure > markmail will be able to pull out several threads on similar topics, > including some about the LDS case.. > > > > Cheers, > > Geert > > > > *Van:* [email protected] [mailto: > [email protected]] *Namens *David Lee > *Verzonden:* woensdag 11 december 2013 18:58 > *Aan:* MarkLogic Developer Discussion > *Onderwerp:* Re: [MarkLogic Dev General] Document Level Authorization > (Roles and Users) > > > > Great bits of info. > > I will add a tidbit more ... but would love the conversation to continue > as I think this is a increasingly common problem/question as ML starts to > get put to use on larger user bases. > > > > 1) In ML7 semantics is *built in* at the core level. It performs *vastly > faster* then the semantic libraries from V6 > > 2) During Query I cannot imagine anything out performing the builtin > role/security module. Its done deeply in the code, it parallizes across > nodes, > > it doesnt require 2 passes (to get the list of URI's then to query > against them) so in theory it should scale to any size and I have not hear > of any cases where number of roles was an issue, > > except in The Admin UI (port 8001) where the GUI there is not optimized > for huge numbers of roles > > 3) The builtin security is *rock solid* ... you cant circumvent it, it > literally hides the existance of documents that you have no rights to. It > has passed numerous security audits and I doubt any but the most dedicated > could equal the security aspects in user code. > > > > BUT ... > > What I dont know ... is the performance of change. > > How expensive is it to > > A) Add a new set of N documents to a role (actually add a new role to a > set of N documents) ... it requires rewriting every document > > B) How expensive is it to create a new collection then add all the > relevant documents to that ? (it requires rewriting every document) > > C) How expensive is it to add a new role if you have thousands or millions > ? Is it linear or does it take increasingly long to maintain large numbers > of roles ? > > > > > > I dont know the answers to these ... but they are worth considering. > > So far to my mind most arguments would favor using the builtin role/user > access for this purpose ... its rock solid for security and for query its > an *obvious* performance gain , > > BUT ... suppose A,B,C are "expensive" AND they are frequently executed, > > at some point it might make more sense to handle user roles at the app > level ... depending on how often you change roles vs how often you query > documents. > > > > > > > > > ----------------------------------------------------------------------------- > > David Lee > Lead Engineer > MarkLogic Corporation > [email protected] > Phone: +1 812-482-5224 > > Cell: +1 812-630-7622 > www.marklogic.com > > > > *From:* [email protected] [ > mailto:[email protected]<[email protected]>] > *On Behalf Of *Harry B. > *Sent:* Wednesday, December 11, 2013 12:37 PM > *To:* MarkLogic Developer Discussion > *Subject:* Re: [MarkLogic Dev General] Document Level Authorization > (Roles and Users) > > > > I have a lot more to add to what you've brought up, David, but short on > time at the moment. I can add a few things quickly and perhaps put together > more detail as a blog post or two later... > > > > First of all, scalability of user roles is really not a huge issue. > MarkLogic stores all the data, roles, etc. as XML, so in theory it's as > massively scalable as needed. That said, I've only implemented/tested this > approach with up to about 2000 users. I have verified it across a few > million documents for those couple thousand users, though. > > > > Secondly, the semantic approach is something else I have done when > role-based options weren't available in the project design. I do think this > is a very strong application logic-based approach and in general it is very > performant for what it is, though good query construction is essential for > it to scale. It is very fast for creating large numbers of shares since > it's an insert operation (or for revoking large numbers of shares). For a > recent project, I initially went with the semantic approach that I had done > with another project, but this time I did a side-by-side comparison. When > there was a user with tens of thousands of shared documents (documents they > were entitled to at least read), the query time was somewhere around 2 > seconds without any tuning or tweaking. The same query using roles took > 0.02 seconds. I basically figured on it being two orders of magnitude > slower. That was a quick and dirty analysis, however, and I don't know if > using ML7's native support instead of a home-grown version of Michael > Blakeley's semantic library or other tuning/optimization might have brought > that down. It was enough at the time to convince me to push for leveraging > the ML security model. > > > > The main reason to use the built in roles to control document access is > that ML has to do that query/processing no matter what. Collections, > semantics, and even adding data or properties to a document all "work" so > it's a matter of balancing your trade-offs. > > > > More in a while... > > > > On Wed, Dec 11, 2013 at 9:06 AM, David Lee <[email protected]> > wrote: > > A first and foremost question to ask is are you asking for server level > security on this sharing or are you happy with application level? > > If you want or need server level security ( that is, if someone were to > access the ML server directly using their credentials and start issuing > queries could they gain access to docs they shouldn't) then the only way I > know of to do this right is using the server supplied role based security. > It is *hard code baked in* ... you s > > imply cannot break it ... you can't even tell the existence of documents > which you do not have access. Its also extremely efficient on query > because its done very deeply in the server.But it comes at the "price" of > using the built in security measures, mainly the price of having to touch > every document that has its role changed or the set of collections changed. > > This is not a bad thing. Its a great thing, but it does limit your > choices and there is a performance hit. (how much ? as with most things > "it depends") > > > > If, on the other hand, you physically restrict access to the ML server to > your app only, and you are confident in *your code* ... then there are > other options. > > > > One I have been thinking about lately is the use of ML7 semantics > features. This is a very lightweight way of storing lists of things, > > it could for example store associations between users and the documents > they can view. Similar to storing this data in an XML file(s) ... but > > much faster for some use cases because of the way its indexed and you dont > have to change the target documents to change the list of who can see them > - unlike changing > > what collections or roles a document has. It does require doing a 2 > phase query though. The first query to list the set of documents a user is > allowed to see, then a second query > > given that list as a constraint onto a search. > > > > I > > > ----------------------------------------------------------------------------- > > David Lee > Lead Engineer > MarkLogic Corporation > [email protected] > Phone: +1 812-482-5224 > > Cell: +1 812-630-7622 > www.marklogic.com > > > > *From:* [email protected] [mailto: > [email protected]] *On Behalf Of *Timothy W. Cook > *Sent:* Wednesday, December 11, 2013 9:44 AM > *To:* MarkLogic Developer Discussion > > > *Subject:* Re: [MarkLogic Dev General] Document Level Authorization > (Roles and Users) > > > > > > > > On Wed, Dec 11, 2013 at 11:41 AM, David Lee <[email protected]> > wrote: > > Harry, how many users have you tried with this scheme ? > I am myself considering something for a demo app but not sure if it scales > to thousands or hundreds of thousands or millions of users. > > > > > > This is my concern also. I need to scale to millions of users. However, > each user will likely have less than one hundred other users to share > documents with. > > > > There is also the issue that if you want to share a large set of documents > to a new user (say 10,000 docs) then those 10,000 docs need to be "touched' > (e.g. read and written), > > this could be a heavy operation. > > > > > > This is a scalability issue I would like to see if someone has experience > with. I could easily have a user with 10,000 or more documents. What is > the performance like when a new share is created across all of them? > > > > > > The alternative, which is not as elegant but might perform better is to > keep access lists as data (say in an XML file or files) and handle the > access control at the user level. > > You are right this is not as clean nor proven as using the system level > access control but it might be > > * faster > > * easier > > > > > > This seems to be a brittle approach. Though it may be the best? > > > > > > Another option might be to store the access list of a document in document > properties. You still have to touch the same number of files but > potentially smaller changes > > (assuming the access list is smaller then the document) and you can do > property based searches combined with document searches so no "joining" > required. > > > > This approach also crossed my mind because in relative terms, my access > list will be small. > > > > I think this would make a great paper or blog > > > > "How to handle access control of large numbers of users and documents" > > > > Good idea. Now we just need to do the research. :-) > > > > > > One thing I am not certain of yet. What are the security and performance > implications of using keywords in a document and then through a query > provide visibility (to the UI) to only some of the documents? IOW: a user > might have read access to documents in a collection, but not knowing that > they exist and not having any access to the collection except via the UI. > Security through obscurity kind of rings out that idea though. THoguhts? > > > > --Tim > > > > > > -- > > MLHIM VIP Signup: http://goo.gl/22B0U > ============================================ > Timothy Cook, MSc +55 21 94711995 > MLHIM http://www.mlhim.org > Like Us on FB: https://www.facebook.com/mlhim2 > Circle us on G+: http://goo.gl/44EV5 > Google Scholar: http://goo.gl/MMZ1o > LinkedIn Profile:http://www.linkedin.com/in/timothywaynecook > > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > > > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > > -- MLHIM VIP Signup: http://goo.gl/22B0U ============================================ Timothy Cook, MSc +55 21 94711995 MLHIM http://www.mlhim.org Like Us on FB: https://www.facebook.com/mlhim2 Circle us on G+: http://goo.gl/44EV5 Google Scholar: http://goo.gl/MMZ1o LinkedIn Profile:http://www.linkedin.com/in/timothywaynecook
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
