I’m also in favor of dropping Scenario 3. One topic we may have discussed in the past but I wanted to close out here: in the relational database world it’s not uncommon to use materialized views as an access control mechanism to selectively expose contents of a table to clients who cannot access the table directly. Does the current thinking on _access for views support that use case? Can we build a view using a set of roles inherited from the user who created the design doc, but then turn around and set the _access on the view itself to a less-restrictive set?
On the _revs_diff topic — I’m not all that concerned about users trying to guess revision IDs that exist on the server, and then reverse-engineer the contents of the existing revisions. Maybe I ought to be. On a somewhat-related note, I have had conversations before with folks who are keen to adopt these sorts of fine-grained access control systems who said they actually prefer to have a 403 Forbidden response list the set of privileges that would be sufficient to access the resource. I found this surprising, but I guess it comes down to a user needing to figure out what kind of security exception to apply for in order to make progress with some data analysis. I think this is a topic on which we could make a fairly late-binding decision — or even have it as a configurable option. I could definitely see the base Scenario 1 (single _access labels) landing ahead of the more-complex sharing models. I haven’t had a chance to take a deep look at the code but the design seems good and thoughtful, and I definitely like the focus on the use cases. Adam > On Mar 14, 2019, at 11:21 AM, Jan Lehnardt <j...@apache.org> wrote: > > My replies now inline. > >> On 14. Mar 2019, at 16:13, Jan Lehnardt <j...@apache.org> wrote: >> >> I received some notes privately from Gregor Martynus, which I’m reproducing >> here in email thread form. This email is all Gregor’s notes, my next email >> is my replies to them. >> >>> On 10. Mar 2019, at 15:51, Jan Lehnardt <j...@apache.org> wrote: >>> >>> Hey all, >>> >>> after mulling this over some more, I’d like to tackle the detailed API and >>> behaviour for this. Especially how _access work in conjunction with >>> existing access control features. >>> >>> My guiding principles so far are: >>> >>> 1. Make the API intuitive, things should work like they look like they >>> should work like. >>> 2. The default should never be that a resources is accidentally left >>> accessible to the public. >>> 3. This should work as a natural extension to the existing security >>> features*. >>> >>> * I’d be up for reworking the whole lot, too, but that might be a better >>> discussion for > 4.0. >>> >>> >>> ## Database Creation and Default Behaviours >>> >>> Creating a database with _access features is, as mentioned before done via >>> a flag to PUT /database?access=true >>> >>> In a 3.0 world where this would land, we already agreed that databases >>> should be admin-only by default (instead of world read/writeable today). >>> This is a sensible default, but that leaves us with an _access enabled >>> database that can’t be used by anyone by server or db admins. Not very >>> useful. >>> >>> To allow arbitrary users to use the db, I suggest we use the existing >>> _security system: i.e. if a user or a group a user belongs to is mentioned >>> in either `admins` or `members` inside of _security, they can proceed and >>> create documents on the db. This puts a second step burden on the >>> application developer, but it slots cleanly into the existing security >>> mechanisms, and doesn’t require special case handling. Alternatively, we >>> could define that _security isn’t available in _access enabled databases, >>> but that’s something I’d like to avoid if at all possible. >>> >>> In order to make it easy to specify that “everyone in _users” should be >>> able to use the db, I suggest we add a new role `_users` that is valid >>> inside _security, which means “everyone in /_users” (this only excludes >>> server admins which have full access anyway). >>> >>> * * * >>> >>> >>> ## Document Creation and Access Control >>> >>> Next, one of our non-admin users creates a doc. There are multiple options >>> as to how we store the _access information. >>> >>> 1. Automatically translate the userCtx.name of a doc creation (not an >>> update) into the first element of the _access array. E.g. user_a PUT >>> /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a >>> little bit counter-intuitive. >>> >>> 2. We require that a user puts "_access":["user_a"] in themselves. This is >>> an explicit granting of access permissions on doc creation and I think is >>> preferable. >> >> I prefer being explicit. >> >> >>> >>> This leaves the edge case of docs that have no _access member: so far I >>> thought those docs are admin-only, with maybe a db-wide option to swap the >>> default to public access, but I think given the explicitness of 2. we can >>> do better: require _access for all new doc creations in access-enabled >>> databases. A user can not create a new document without an _access field >>> that is an array that has at least one member. For public documents, we >>> could invent a new role _public, and admin-only docs could use the existing >>> role _admin. >>> >>> The one downside to this approach is that we won’t be able to replicate >>> existing databases into an access-enabled database without modifying all >>> documents. This might be a worthwhile trade-off, but we should make that >>> decision consciously and document it well. >> >> We could also provide tooling for migrations? > > I’d love tooling, but we’d have to make sure we can do it correctly for a big > number of use-cases. For the acceptance of this change, I’d make “documenting > a migration path for db-per-user setups” a MUST have, and any code that helps > with that a nice to have. > >> >> >>> We could allow for a special case where an _admin user can create docs that >>> have no _access field, and those docs are treated as having only the _admin >>> role in _access. So at least we could replicate all data in, but then >>> require a manual step to update all docs to say, migrate an existing >>> db-per-user app, while not accidentally exposing any docs to folks that >>> shouldn’t read them. >>> >>> For the rest of cRUD, the existing document must store one of the RUD-ing >>> user’s name or role in its _access field. >>> >>> For both creations and updates, a user MUST supply at least one role they >>> belong to or their own username. >>> >>> * * * >>> >>> >>> ## _revs_diff >>> >>> /db/_revs_diff can answer the question of which revisions of a document do >>> NOT exist on a replication target: >>> http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff >>> >>> This would allow users to specify ids and rev(s) for docs they don’t have >>> access too (anymore), so the result schema should be expanded to handle id: >>> unauthorized or somesuch, something the replicator needs to know what to do >>> with, if it encounters it (say a user got removed from the _access list >>> inbetween the replicator opening _changes and requesting the doc). >>> >>> The _revs_diff implementation would have to altered to send an unauthorized >>> token for each doc the requesting userCtx has no access to. If we can >>> re-use some of our existing indexes, or any other performance optimisation, >>> that’d be great. I haven’t looked at that code at all, yet. >>> >>> An important side-effect of this is, once a user has been added to a doc’s >>> _access list, they get access to “the full history of the doc”, even before >>> they had access. Of course, in CouchDB this means only getting access to >>> the rev ids, and not the content, but since they are content-addressable >>> hashes, a user could brute-force themselves into revealing certain real >>> values from earlier incarnations of the doc. I’d rather not track _access >>> per document revision in perpetuity, so this is something we have to be >>> very up-front about. >>> >>> * * * >>> >>> >>> ## Partitioned Databases >>> >>> I mentioned partitioned databases in my previous mail, and I think it is >>> something we can document that end-users can opt into, but doesn’t require >>> any special casing on the _access proposal. That is, if users start >>> prefixing their doc ids with a user name or id and enable both _access and >>> partitions, then they get all the benefits of a partitioned database, and >>> if they choose not to, they don’t, but things keep working. There are >>> enough use-cases to warrant both behaviours. >>> >>> * * * >>> >>> >>> ## Scenarios that _access should help with. >>> >>> Overall, we developed _access to allow users to stop using the db-per-user >>> architecture, but once we have per-doc-access control, folks might start >>> using this for all manner of things. We should be clear about which >>> scenarios we support and which we don’t. >>> >>> >>> ### Scenario 1: db-per-user >>> >>> In this scenario, _access enabled databases, the only way to allow mutually >>> untrusting users to store data in a part of CouchDB that only they (and >>> admins) have access to was giving each user their own database. >>> >>> In an _access enabled database, users can >>> CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user >>> (aside from admins) can access those docs. >>> >>> This is the simplest scenario, as all we’d have to track the owner of a >>> document and produce by-access-id/seq indexes based on that owner. >>> >>> The current prototype implementation mostly reflects this stage. Not saying >>> this is what we should ship, but it is the easiest do implement and explain. >>> >>> Aside, I might be able to be persuaded to ship this as a 2.x feature, to >>> help those folks who don’t need anything else. >>> >>> >>> ### Scenario 2: db-per-user + Sharing >> >> One scenario we should address is how stopping to share would work when >> documents are continuously replicated, e.g. to a client for offline usage. >> My understanding is that for the person who’s access to documents got >> revoked does not get _changes update telling them that their access got >> removed, it would be up to the application developer to implement some kind >> of "notification" meta documents. Unless you have a better idea? > > Since we now have a purge API as well, we could treat an un-share as a purge > for clients, and they can decide what to do with it. > > Alternatively, we need to make breaking changes to _changes feed, maybe we > can hide that behind an opt-in flag, like “/db/_changes?access=true”, and > then we can send new rows like: > > {seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch. > > >> >>> >>> The second we allow per doc auth, users will want to share those docs with >>> other users. That’s why we initially suggested the _access field be an >>> array, so other users and groups can be specified to have access. There are >>> multiple scenarios in this one alone: >>> >>> #### 2.1: The Todo List >>> >>> In this scenario, a user has a reasonable amount of ”personal data” that >>> they want to selectively share with one or more other users. >>> >>> #### 2.2: The Chat/Forum/Newsgroup >>> >>> In this scenario, a user wants to share any number of documents with a >>> reasonable number of groups. However, since we need to limit the number of >>> groups a user belongs to (currently 10, see below for details), this might >>> actually not be a great solution. Or folks couldn’t be in more than 10 chat >>> groups at a time. >>> >>> #### 2.3: The Corporate Hierarchy >>> >>> In this scenario, users want to share any number of docs with a reasonable >>> number of groups in a top-down/bottom-up fashion. Think CEO shares with >>> executives, execs share with divisions, divisions report up to their one >>> executive, etc. >>> >>> >>> ### 3: Multiple Apps >>> >>> The preceding scenarios all assume that a single application is responsible >>> for everything. However, once we allow mutually distrusting users into a >>> single database *and* make each per-user slice work (almost) like a full >>> standalone CouchDB database, what would stop users from using this for a >>> multi-homing feature, where different applications are used for each user >>> in the same database? >>> >>> I’ll be referring to these scenarios down the line. >>> >>> * * * >>> >>> >>> ## Design Docs >>> >>> ### Admin >>> >>> One of the downsides of db-per-user is managing design docs in the face of >>> a changing application, that is, how to distribute new design docs across >>> 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all >>> scenarios above but scenario 3., we could simplify this significantly. Say >>> an admin creates a design doc, and gives all users in the db access to this >>> design doc (this could be with the _users role, or yet another new role >>> _members, if we need it), requesting the result of a view defined in that >>> design doc will produce an index that is powered by the requesting user’s >>> by-access-seq index section(s). >>> >>> N.B., this would require us to change a fundamental assumption when doing >>> the association between a design doc’s definition and index: normally, >>> there is only the `views` member that is hashed and that hash is used as >>> the index’s filename. Because there is only by-seq to power a view, that >>> all works. But now that we have an arbitrary set of sections on >>> by-access-seq, any view index built will have to take a user’s name and >>> roles into account. When a user leaves a group, or gains a group, all >>> indexes for that user will no longer be valid and need rebuilding. >>> >>> >>> ### User >>> >>> In any of the scenarios above, but especially 3., there could be legitimate >>> per-user design docs, so how should those be treated in an _access enabled >>> database? >>> >>> The significant fields in a design doc are `views`, `validate_doc_update` >>> and `filters` (I’ll skip over the deprecated _show, _list, and _update). >>> >>> The easiest to handle is a `filters`: if a user specifies a filter for a >>> _changes request or replication that lives in a design doc they don’t have >>> access to, they get an error, similar to if they specify a non-existent >>> design doc, just with `unauthorized` instead of `not_found`. >>> >>> Next `views` is also not very hard to imagine working: just like globally >>> defined views for that db, the index is built for each user based on the >>> user’s name and roles. >>> >>> More troubling are `validate_doc_update` functions: One, they are already >>> troubling in that they slow down any document updates. Two, if we now >>> import an existing db-per-user scenario where each user has their own >>> design docs, >> >> I can’t think of a db-per-user scenario where each user DB would have a >> different validate_doc_update method? It would be the same method with >> access to the user context, the DBs security setting and the document, so it >> would act differently for different users, but using the same code. > > They wouldn’t be different, but if we were do replicate 1000 db-per-user > design docs into a single database, as per today’s semantics, we’d have to > run 1000 VDUs on each doc update. > >> >>> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are >>> impractical to apply on each doc update, let alone just the management of >>> VDUs that are active on a database. One option would be to ignore VDUs if >>> they are not defined globally (say with a _members role). But especially in >>> scenario 3. this becomes problematic, but even without that specific >>> scenario, this violates the no surprises best practice. >>> >>> We could say: >>> >>> a) we don’t support scenario 3. >> >> +1, I think it would make our lives easier in general if we don’t recommend >> to share the same CouchDB for multiple apps. At least I don’t see a reason >> to do that at this point. > > I think I like this best, too, but I’d like to hear from others as well. > > > Best > Jan > — >> >>> b) we find a complicated but efficient way to apply only those VDUs that >>> are defined in design docs the writing user has access to plus any global >>> ones (this would be neat but rather complicated and potentially still >>> impractical from a performance perspective for N users). >>> c) we could store all per-user design docs, but ignore them completely, >>> VDUs, views and filters. >>> >>> I think I currently fall on the side of not supporting scenario 3. and >>> asking folks who migrate db-per-user to de-duplicate design docs and keep >>> them per-app. I believe that is a good trade-off between the most common >>> scenarios for db-per-user while keeping the implementation manageable. >>> Globally accessible design docs would show up in a user’s changes feed and >>> would replicate down to say a PouchDB application which might be the >>> exclusive user of those design docs. >>> >>> In practice this would mean, a document that has an _id that starts with >>> _design/ will have to be produced by a database admin. Luckily, that’s >>> already the case. We should just make sure that folks don’t give db-admin >>> access to all users habitually. >>> >>> >>> ## Read and Write Access >>> >>> Speaking of validate_doc_update, it is used for two things: checking >>> document schema and doc update authorisation. >>> >>> Once we allow access to a document with an _access field, we need to decide >>> what kind of access this gives to a doc: read-only or read-write (I’m not >>> considering write-only because for anything but doc creations this is not >>> useful as you need access to the current _rev). >>> >>> However, when we look at implementing an application on top of our existing >>> API, it is already weird that read access can be controlled globally (or >>> with _access on a per doc level), but write access requires writing >>> JavaScript code. I think it would be a reasonable expectation for users to >>> expect a per-doc read/write permission granting. >> >> Yes! >> >>> >>> So we could have all of the above, but with two extra fields: _access_read >>> and _access_write, or _access: {read: [], write: []} >> >> I prefer this API for its compactness, thinking about offline >> synchronization. The smaller the docs, the better. >> >> Best >> “Gregor” >> — >> >> >>> or we overload user and group names: _access: [user_a:read, user_b:write] >>> (or any permutation thereof). Overloading can cause trouble with naturally >>> occurring characters in group names. >>> >>> The former seems more explicit, but from an API perspective that’s a little >>> more awkward: remember that we currently have an arbitrary limit of 10 >>> members in a user’s role array, to avoid excessive fan out on >>> cluster-internal operations. Partitioned dbs could get away with more, more >>> easily however. If we allow the specification of access control in two >>> lists, and one of the lists implies membership in the other, we have a >>> total limit of 10 members across both arrays. Or we limit 5 + 5, but that >>> seems excessive, while 10 total seems weird, but doable. Anyway, good >>> bikeshed. >>> >>> >>> * * * >>> >>> >>> So far. I think all of the problems outlined are solvable, if with a clear >>> definition of what use-cases we do not support with access. If you have >>> more scenarios than the ones I outlined, please add them and we can see if >>> they cause any additional trouble. >>> >>> Thanks for reading this far and I’m looking forward to your feedback. >>> >>> >>> Best, >>> Jan “_access” Lehnardt >>> — >>> >>> >>> >>> >>>> On 17. Feb 2019, at 15:25, Jan Lehnardt <j...@apache.org> wrote: >>>> >>>> Hi Everyone, >>>> >>>> I’m happy to share my work in progress attempt to implement the per-doc >>>> access control feature we discussed a good while ago: >>>> >>>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E >>>> >>>> <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E> >>>> >>>> You can check out my branch here: >>>> >>>> https://github.com/apache/couchdb/compare/access?expand=1 >>>> <https://github.com/apache/couchdb/compare/access?expand=1> >>>> >>>> It is very much work in progress, but it is far enough along to warrant >>>> discussion. >>>> >>>> The main point of this branch is to show all the places that we would need >>>> to change to support the proposal. >>>> >>>> Things I’ve left for later: >>>> >>>> - currently only the first element in the _access array is used. Our >>>> and/or syntax can be added later. >>>> - building per-access views has not been implemented yet, couch_index >>>> would have to be taught about the new per-access-id index. >>>> - pretty HTTP error handling >>>> - tests except for a tiny shell script 😇 >>>> >>>> Implementation notes: >>>> >>>> You create a database with the _access feature turned on like so: PUT >>>> /db?access=true >>>> >>>> I started out with storing _access in the document body, as that would >>>> allow for a minimal change set, however, on doc updates, we try hard not >>>> to load the old doc body from the database, and forcing us to do so for >>>> EVERY doc update under _access seemed prohibitive, so I extended the #doc, >>>> #doc_info and #full_doc_info records with a new `access` attribute that is >>>> stored in both by-id and by-seq. I will need guidance on how extending >>>> these records impact multi-version cluster interop. And especially whether >>>> this is an acceptable approach. >>>> >>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36 >>>> >>>> * * * >>>> >>>> The main addition is a new native query server called >>>> couch_access_native_proc, which implements two new indexes by-access-id >>>> and by-access-seq which do what you’d expect, pass in a userCtx and >>>> retrieve the equivalent of _all_docs or _changes, but only including those >>>> docs that match the username and roles in their _access property. The >>>> existing handlers for _all_docs and _changes have been augmented to use >>>> the new indexes instead of the default ones, unless the user is an admin. >>>> >>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4 >>>> >>>> * * * >>>> >>>> The rest of the diff is concerned with making document CRUD behave as >>>> you’d expect it. See this little demonstration for what things look like: >>>> >>>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 >>>> <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just >>>> noticing that there might be something wonky with DELETE, but you’ll get >>>> the gist #rimshot) >>>> >>>> * * * >>>> >>>> Open questions: >>>> >>>> - The aim of this is to get as close to regular CouchDB behaviour as >>>> possible. One thing that is new however which would require all apps to be >>>> changed is that for an _access enabled database to include an _access >>>> field in their docs (docs with no _access are admin-only for now). We >>>> might want to consider on new document writes to auto-insert the >>>> authenticated user’s name as the first element in the _access array, so >>>> existing apps “just work”. >>>> >>>> - Interplay with partitioned dbs: eschewing db-per-user is already a large >>>> boon if you have a lot of users, but making those per-user requests inside >>>> an _access enabled database efficient would be doubly nice, so why not use >>>> the username from the first question above and use that as the partition >>>> key? This would work nicely for natural users with their own docs that >>>> want to share them with others later, but I can easily imagine a pipelined >>>> use of CouchDB, where a “collector” user creates all new docs, an >>>> “analyser” takes them over and hand them to a “result” user for viewing. >>>> In that case, we’d violate the high-cardinality rule of partitions (have a >>>> lot of small ones), instead all docs go through all three users. I’d be >>>> okay with treating the later scenario as a minor use-case, but for that >>>> use-case, we should be able to disable auto-partitioning on db creation. >>>> >>>> - building access view indexes for docs that have frequent _access >>>> changes, lead to many orphaned view indexes, we should look at an >>>> auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap >>>> back and forth). >>>> >>>> * * * >>>> >>>> I’ll leave this here for now, I’m sure there are a few more things to >>>> consider. >>>> >>>> I’d love to hear any and all feedback you might have. Especially if >>>> anything is unclear. >>>> >>>> Best >>>> Jan >>>> — >>> >>> -- >>> Professional Support for Apache CouchDB: >>> https://neighbourhood.ie/couchdb-support/ >>> >> >> -- >> Professional Support for Apache CouchDB: >> https://neighbourhood.ie/couchdb-support/ >> > > -- > Professional Support for Apache CouchDB: > https://neighbourhood.ie/couchdb-support/ > <https://neighbourhood.ie/couchdb-support/>