My replies now inline. > On 14. Mar 2019, at 16:13, Jan Lehnardt <j...@apache.org> wrote: > > I received some notes privately from Gregor Martynus, which I’m reproducing > here in email thread form. This email is all Gregor’s notes, my next email is > my replies to them. > >> On 10. Mar 2019, at 15:51, Jan Lehnardt <j...@apache.org> wrote: >> >> Hey all, >> >> after mulling this over some more, I’d like to tackle the detailed API and >> behaviour for this. Especially how _access work in conjunction with existing >> access control features. >> >> My guiding principles so far are: >> >> 1. Make the API intuitive, things should work like they look like they >> should work like. >> 2. The default should never be that a resources is accidentally left >> accessible to the public. >> 3. This should work as a natural extension to the existing security >> features*. >> >> * I’d be up for reworking the whole lot, too, but that might be a better >> discussion for > 4.0. >> >> >> ## Database Creation and Default Behaviours >> >> Creating a database with _access features is, as mentioned before done via a >> flag to PUT /database?access=true >> >> In a 3.0 world where this would land, we already agreed that databases >> should be admin-only by default (instead of world read/writeable today). >> This is a sensible default, but that leaves us with an _access enabled >> database that can’t be used by anyone by server or db admins. Not very >> useful. >> >> To allow arbitrary users to use the db, I suggest we use the existing >> _security system: i.e. if a user or a group a user belongs to is mentioned >> in either `admins` or `members` inside of _security, they can proceed and >> create documents on the db. This puts a second step burden on the >> application developer, but it slots cleanly into the existing security >> mechanisms, and doesn’t require special case handling. Alternatively, we >> could define that _security isn’t available in _access enabled databases, >> but that’s something I’d like to avoid if at all possible. >> >> In order to make it easy to specify that “everyone in _users” should be able >> to use the db, I suggest we add a new role `_users` that is valid inside >> _security, which means “everyone in /_users” (this only excludes server >> admins which have full access anyway). >> >> * * * >> >> >> ## Document Creation and Access Control >> >> Next, one of our non-admin users creates a doc. There are multiple options >> as to how we store the _access information. >> >> 1. Automatically translate the userCtx.name of a doc creation (not an >> update) into the first element of the _access array. E.g. user_a PUT /db/doc >> {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit >> counter-intuitive. >> >> 2. We require that a user puts "_access":["user_a"] in themselves. This is >> an explicit granting of access permissions on doc creation and I think is >> preferable. > > I prefer being explicit. > > >> >> This leaves the edge case of docs that have no _access member: so far I >> thought those docs are admin-only, with maybe a db-wide option to swap the >> default to public access, but I think given the explicitness of 2. we can do >> better: require _access for all new doc creations in access-enabled >> databases. A user can not create a new document without an _access field >> that is an array that has at least one member. For public documents, we >> could invent a new role _public, and admin-only docs could use the existing >> role _admin. >> >> The one downside to this approach is that we won’t be able to replicate >> existing databases into an access-enabled database without modifying all >> documents. This might be a worthwhile trade-off, but we should make that >> decision consciously and document it well. > > We could also provide tooling for migrations?
I’d love tooling, but we’d have to make sure we can do it correctly for a big number of use-cases. For the acceptance of this change, I’d make “documenting a migration path for db-per-user setups” a MUST have, and any code that helps with that a nice to have. > > >> We could allow for a special case where an _admin user can create docs that >> have no _access field, and those docs are treated as having only the _admin >> role in _access. So at least we could replicate all data in, but then >> require a manual step to update all docs to say, migrate an existing >> db-per-user app, while not accidentally exposing any docs to folks that >> shouldn’t read them. >> >> For the rest of cRUD, the existing document must store one of the RUD-ing >> user’s name or role in its _access field. >> >> For both creations and updates, a user MUST supply at least one role they >> belong to or their own username. >> >> * * * >> >> >> ## _revs_diff >> >> /db/_revs_diff can answer the question of which revisions of a document do >> NOT exist on a replication target: >> http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff >> >> This would allow users to specify ids and rev(s) for docs they don’t have >> access too (anymore), so the result schema should be expanded to handle id: >> unauthorized or somesuch, something the replicator needs to know what to do >> with, if it encounters it (say a user got removed from the _access list >> inbetween the replicator opening _changes and requesting the doc). >> >> The _revs_diff implementation would have to altered to send an unauthorized >> token for each doc the requesting userCtx has no access to. If we can re-use >> some of our existing indexes, or any other performance optimisation, that’d >> be great. I haven’t looked at that code at all, yet. >> >> An important side-effect of this is, once a user has been added to a doc’s >> _access list, they get access to “the full history of the doc”, even before >> they had access. Of course, in CouchDB this means only getting access to the >> rev ids, and not the content, but since they are content-addressable hashes, >> a user could brute-force themselves into revealing certain real values from >> earlier incarnations of the doc. I’d rather not track _access per document >> revision in perpetuity, so this is something we have to be very up-front >> about. >> >> * * * >> >> >> ## Partitioned Databases >> >> I mentioned partitioned databases in my previous mail, and I think it is >> something we can document that end-users can opt into, but doesn’t require >> any special casing on the _access proposal. That is, if users start >> prefixing their doc ids with a user name or id and enable both _access and >> partitions, then they get all the benefits of a partitioned database, and if >> they choose not to, they don’t, but things keep working. There are enough >> use-cases to warrant both behaviours. >> >> * * * >> >> >> ## Scenarios that _access should help with. >> >> Overall, we developed _access to allow users to stop using the db-per-user >> architecture, but once we have per-doc-access control, folks might start >> using this for all manner of things. We should be clear about which >> scenarios we support and which we don’t. >> >> >> ### Scenario 1: db-per-user >> >> In this scenario, _access enabled databases, the only way to allow mutually >> untrusting users to store data in a part of CouchDB that only they (and >> admins) have access to was giving each user their own database. >> >> In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff >> their own docs knowing no other user (aside from admins) can access those >> docs. >> >> This is the simplest scenario, as all we’d have to track the owner of a >> document and produce by-access-id/seq indexes based on that owner. >> >> The current prototype implementation mostly reflects this stage. Not saying >> this is what we should ship, but it is the easiest do implement and explain. >> >> Aside, I might be able to be persuaded to ship this as a 2.x feature, to >> help those folks who don’t need anything else. >> >> >> ### Scenario 2: db-per-user + Sharing > > One scenario we should address is how stopping to share would work when > documents are continuously replicated, e.g. to a client for offline usage. My > understanding is that for the person who’s access to documents got revoked > does not get _changes update telling them that their access got removed, it > would be up to the application developer to implement some kind of > "notification" meta documents. Unless you have a better idea? Since we now have a purge API as well, we could treat an un-share as a purge for clients, and they can decide what to do with it. Alternatively, we need to make breaking changes to _changes feed, maybe we can hide that behind an opt-in flag, like “/db/_changes?access=true”, and then we can send new rows like: {seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch. > >> >> The second we allow per doc auth, users will want to share those docs with >> other users. That’s why we initially suggested the _access field be an >> array, so other users and groups can be specified to have access. There are >> multiple scenarios in this one alone: >> >> #### 2.1: The Todo List >> >> In this scenario, a user has a reasonable amount of ”personal data” that >> they want to selectively share with one or more other users. >> >> #### 2.2: The Chat/Forum/Newsgroup >> >> In this scenario, a user wants to share any number of documents with a >> reasonable number of groups. However, since we need to limit the number of >> groups a user belongs to (currently 10, see below for details), this might >> actually not be a great solution. Or folks couldn’t be in more than 10 chat >> groups at a time. >> >> #### 2.3: The Corporate Hierarchy >> >> In this scenario, users want to share any number of docs with a reasonable >> number of groups in a top-down/bottom-up fashion. Think CEO shares with >> executives, execs share with divisions, divisions report up to their one >> executive, etc. >> >> >> ### 3: Multiple Apps >> >> The preceding scenarios all assume that a single application is responsible >> for everything. However, once we allow mutually distrusting users into a >> single database *and* make each per-user slice work (almost) like a full >> standalone CouchDB database, what would stop users from using this for a >> multi-homing feature, where different applications are used for each user in >> the same database? >> >> I’ll be referring to these scenarios down the line. >> >> * * * >> >> >> ## Design Docs >> >> ### Admin >> >> One of the downsides of db-per-user is managing design docs in the face of a >> changing application, that is, how to distribute new design docs across 10s >> of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios >> above but scenario 3., we could simplify this significantly. Say an admin >> creates a design doc, and gives all users in the db access to this design >> doc (this could be with the _users role, or yet another new role _members, >> if we need it), requesting the result of a view defined in that design doc >> will produce an index that is powered by the requesting user’s by-access-seq >> index section(s). >> >> N.B., this would require us to change a fundamental assumption when doing >> the association between a design doc’s definition and index: normally, there >> is only the `views` member that is hashed and that hash is used as the >> index’s filename. Because there is only by-seq to power a view, that all >> works. But now that we have an arbitrary set of sections on by-access-seq, >> any view index built will have to take a user’s name and roles into account. >> When a user leaves a group, or gains a group, all indexes for that user will >> no longer be valid and need rebuilding. >> >> >> ### User >> >> In any of the scenarios above, but especially 3., there could be legitimate >> per-user design docs, so how should those be treated in an _access enabled >> database? >> >> The significant fields in a design doc are `views`, `validate_doc_update` >> and `filters` (I’ll skip over the deprecated _show, _list, and _update). >> >> The easiest to handle is a `filters`: if a user specifies a filter for a >> _changes request or replication that lives in a design doc they don’t have >> access to, they get an error, similar to if they specify a non-existent >> design doc, just with `unauthorized` instead of `not_found`. >> >> Next `views` is also not very hard to imagine working: just like globally >> defined views for that db, the index is built for each user based on the >> user’s name and roles. >> >> More troubling are `validate_doc_update` functions: One, they are already >> troubling in that they slow down any document updates. Two, if we now import >> an existing db-per-user scenario where each user has their own design docs, > > I can’t think of a db-per-user scenario where each user DB would have a > different validate_doc_update method? It would be the same method with access > to the user context, the DBs security setting and the document, so it would > act differently for different users, but using the same code. They wouldn’t be different, but if we were do replicate 1000 db-per-user design docs into a single database, as per today’s semantics, we’d have to run 1000 VDUs on each doc update. > >> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are >> impractical to apply on each doc update, let alone just the management of >> VDUs that are active on a database. One option would be to ignore VDUs if >> they are not defined globally (say with a _members role). But especially in >> scenario 3. this becomes problematic, but even without that specific >> scenario, this violates the no surprises best practice. >> >> We could say: >> >> a) we don’t support scenario 3. > > +1, I think it would make our lives easier in general if we don’t recommend > to share the same CouchDB for multiple apps. At least I don’t see a reason to > do that at this point. I think I like this best, too, but I’d like to hear from others as well. Best Jan — > >> b) we find a complicated but efficient way to apply only those VDUs that are >> defined in design docs the writing user has access to plus any global ones >> (this would be neat but rather complicated and potentially still impractical >> from a performance perspective for N users). >> c) we could store all per-user design docs, but ignore them completely, >> VDUs, views and filters. >> >> I think I currently fall on the side of not supporting scenario 3. and >> asking folks who migrate db-per-user to de-duplicate design docs and keep >> them per-app. I believe that is a good trade-off between the most common >> scenarios for db-per-user while keeping the implementation manageable. >> Globally accessible design docs would show up in a user’s changes feed and >> would replicate down to say a PouchDB application which might be the >> exclusive user of those design docs. >> >> In practice this would mean, a document that has an _id that starts with >> _design/ will have to be produced by a database admin. Luckily, that’s >> already the case. We should just make sure that folks don’t give db-admin >> access to all users habitually. >> >> >> ## Read and Write Access >> >> Speaking of validate_doc_update, it is used for two things: checking >> document schema and doc update authorisation. >> >> Once we allow access to a document with an _access field, we need to decide >> what kind of access this gives to a doc: read-only or read-write (I’m not >> considering write-only because for anything but doc creations this is not >> useful as you need access to the current _rev). >> >> However, when we look at implementing an application on top of our existing >> API, it is already weird that read access can be controlled globally (or >> with _access on a per doc level), but write access requires writing >> JavaScript code. I think it would be a reasonable expectation for users to >> expect a per-doc read/write permission granting. > > Yes! > >> >> So we could have all of the above, but with two extra fields: _access_read >> and _access_write, or _access: {read: [], write: []} > > I prefer this API for its compactness, thinking about offline > synchronization. The smaller the docs, the better. > > Best > “Gregor” > — > > >> or we overload user and group names: _access: [user_a:read, user_b:write] >> (or any permutation thereof). Overloading can cause trouble with naturally >> occurring characters in group names. >> >> The former seems more explicit, but from an API perspective that’s a little >> more awkward: remember that we currently have an arbitrary limit of 10 >> members in a user’s role array, to avoid excessive fan out on >> cluster-internal operations. Partitioned dbs could get away with more, more >> easily however. If we allow the specification of access control in two >> lists, and one of the lists implies membership in the other, we have a total >> limit of 10 members across both arrays. Or we limit 5 + 5, but that seems >> excessive, while 10 total seems weird, but doable. Anyway, good bikeshed. >> >> >> * * * >> >> >> So far. I think all of the problems outlined are solvable, if with a clear >> definition of what use-cases we do not support with access. If you have more >> scenarios than the ones I outlined, please add them and we can see if they >> cause any additional trouble. >> >> Thanks for reading this far and I’m looking forward to your feedback. >> >> >> Best, >> Jan “_access” Lehnardt >> — >> >> >> >> >>> On 17. Feb 2019, at 15:25, Jan Lehnardt <j...@apache.org> wrote: >>> >>> Hi Everyone, >>> >>> I’m happy to share my work in progress attempt to implement the per-doc >>> access control feature we discussed a good while ago: >>> >>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E >>> >>> <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E> >>> >>> You can check out my branch here: >>> >>> https://github.com/apache/couchdb/compare/access?expand=1 >>> <https://github.com/apache/couchdb/compare/access?expand=1> >>> >>> It is very much work in progress, but it is far enough along to warrant >>> discussion. >>> >>> The main point of this branch is to show all the places that we would need >>> to change to support the proposal. >>> >>> Things I’ve left for later: >>> >>> - currently only the first element in the _access array is used. Our and/or >>> syntax can be added later. >>> - building per-access views has not been implemented yet, couch_index would >>> have to be taught about the new per-access-id index. >>> - pretty HTTP error handling >>> - tests except for a tiny shell script 😇 >>> >>> Implementation notes: >>> >>> You create a database with the _access feature turned on like so: PUT >>> /db?access=true >>> >>> I started out with storing _access in the document body, as that would >>> allow for a minimal change set, however, on doc updates, we try hard not to >>> load the old doc body from the database, and forcing us to do so for EVERY >>> doc update under _access seemed prohibitive, so I extended the #doc, >>> #doc_info and #full_doc_info records with a new `access` attribute that is >>> stored in both by-id and by-seq. I will need guidance on how extending >>> these records impact multi-version cluster interop. And especially whether >>> this is an acceptable approach. >>> >>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36 >>> >>> * * * >>> >>> The main addition is a new native query server called >>> couch_access_native_proc, which implements two new indexes by-access-id and >>> by-access-seq which do what you’d expect, pass in a userCtx and retrieve >>> the equivalent of _all_docs or _changes, but only including those docs that >>> match the username and roles in their _access property. The existing >>> handlers for _all_docs and _changes have been augmented to use the new >>> indexes instead of the default ones, unless the user is an admin. >>> >>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4 >>> >>> * * * >>> >>> The rest of the diff is concerned with making document CRUD behave as you’d >>> expect it. See this little demonstration for what things look like: >>> >>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 >>> <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just >>> noticing that there might be something wonky with DELETE, but you’ll get >>> the gist #rimshot) >>> >>> * * * >>> >>> Open questions: >>> >>> - The aim of this is to get as close to regular CouchDB behaviour as >>> possible. One thing that is new however which would require all apps to be >>> changed is that for an _access enabled database to include an _access field >>> in their docs (docs with no _access are admin-only for now). We might want >>> to consider on new document writes to auto-insert the authenticated user’s >>> name as the first element in the _access array, so existing apps “just >>> work”. >>> >>> - Interplay with partitioned dbs: eschewing db-per-user is already a large >>> boon if you have a lot of users, but making those per-user requests inside >>> an _access enabled database efficient would be doubly nice, so why not use >>> the username from the first question above and use that as the partition >>> key? This would work nicely for natural users with their own docs that want >>> to share them with others later, but I can easily imagine a pipelined use >>> of CouchDB, where a “collector” user creates all new docs, an “analyser” >>> takes them over and hand them to a “result” user for viewing. In that case, >>> we’d violate the high-cardinality rule of partitions (have a lot of small >>> ones), instead all docs go through all three users. I’d be okay with >>> treating the later scenario as a minor use-case, but for that use-case, we >>> should be able to disable auto-partitioning on db creation. >>> >>> - building access view indexes for docs that have frequent _access changes, >>> lead to many orphaned view indexes, we should look at an auto-cleanup >>> solution here (maybe keep 1-N indexes in case folks just swap back and >>> forth). >>> >>> * * * >>> >>> I’ll leave this here for now, I’m sure there are a few more things to >>> consider. >>> >>> I’d love to hear any and all feedback you might have. Especially if >>> anything is unclear. >>> >>> Best >>> Jan >>> — >> >> -- >> Professional Support for Apache CouchDB: >> https://neighbourhood.ie/couchdb-support/ >> > > -- > Professional Support for Apache CouchDB: > https://neighbourhood.ie/couchdb-support/ > -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/