Re: [DISCUSS] Per-doc access control

Jan Lehnardt Thu, 14 Mar 2019 08:20:08 -0700

My replies now inline.

> On 14. Mar 2019, at 16:13, Jan Lehnardt <j...@apache.org> wrote:
> 
> I received some notes privately from Gregor Martynus, which I’m reproducing 
> here in email thread form. This email is all Gregor’s notes, my next email is 
> my replies to them.
> 
>> On 10. Mar 2019, at 15:51, Jan Lehnardt <j...@apache.org> wrote:
>> 
>> Hey all,
>> 
>> after mulling this over some more, I’d like to tackle the detailed API and 
>> behaviour for this. Especially how _access work in conjunction with existing 
>> access control features.
>> 
>> My guiding principles so far are:
>> 
>> 1. Make the API intuitive, things should work like they look like they 
>> should work like.
>> 2. The default should never be that a resources is accidentally left 
>> accessible to the public.
>> 3. This should work as a natural extension to the existing security 
>> features*.
>> 
>> * I’d be up for reworking the whole lot, too, but that might be a better 
>> discussion for > 4.0.
>> 
>> 
>> ## Database Creation and Default Behaviours
>> 
>> Creating a database with _access features is, as mentioned before done via a 
>> flag to PUT /database?access=true
>> 
>> In a 3.0 world where this would land, we already agreed that databases 
>> should be admin-only by default (instead of world read/writeable today). 
>> This is a sensible default, but that leaves us with an _access enabled 
>> database that can’t be used by anyone by server or db admins. Not very 
>> useful.
>> 
>> To allow arbitrary users to use the db, I suggest we use the existing 
>> _security system: i.e. if a user or a group a user belongs to is mentioned 
>> in either `admins` or `members` inside of _security, they can proceed and 
>> create documents on the db. This puts a second step burden on the 
>> application developer, but it slots cleanly into the existing security 
>> mechanisms, and doesn’t require special case handling. Alternatively, we 
>> could define that _security isn’t available in _access enabled databases, 
>> but that’s something I’d like to avoid if at all possible.
>> 
>> In order to make it easy to specify that “everyone in _users” should be able 
>> to use the db, I suggest we add a new role `_users` that is valid inside 
>> _security, which means “everyone in /_users” (this only excludes server 
>> admins which have full access anyway).
>> 
>> * * *
>> 
>> 
>> ## Document Creation and Access Control
>> 
>> Next, one of our non-admin users creates a doc. There are multiple options 
>> as to how we store the _access information.
>> 
>> 1. Automatically translate the userCtx.name of a doc creation (not an 
>> update) into the first element of the _access array. E.g. user_a PUT /db/doc 
>> {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a little bit 
>> counter-intuitive.
>> 
>> 2. We require that a user puts "_access":["user_a"] in themselves. This is 
>> an explicit granting of access permissions on doc creation and I think is 
>> preferable.
> 
> I prefer being explicit.
> 
> 
>> 
>> This leaves the edge case of docs that have no _access member: so far I 
>> thought those docs are admin-only, with maybe a db-wide option to swap the 
>> default to public access, but I think given the explicitness of 2. we can do 
>> better: require _access for all new doc creations in access-enabled 
>> databases. A user can not create a new document without an _access field 
>> that is an array that has at least one member. For public documents, we 
>> could invent a new role _public, and admin-only docs could use the existing 
>> role _admin.
>> 
>> The one downside to this approach is that we won’t be able to replicate 
>> existing databases into an access-enabled database without modifying all 
>> documents. This might be a worthwhile trade-off, but we should make that 
>> decision consciously and document it well.
> 
> We could also provide tooling for migrations?


I’d love tooling, but we’d have to make sure we can do it correctly for a big 
number of use-cases. For the acceptance of this change, I’d make “documenting a 
migration path for db-per-user setups” a MUST have, and any code that helps 
with that a nice to have.

> 
> 
>> We could allow for a special case where an _admin user can create docs that 
>> have no _access field, and those docs are treated as having only the _admin 
>> role in _access. So at least we could replicate all data in, but then 
>> require a manual step to update all docs to say, migrate an existing 
>> db-per-user app, while not accidentally exposing any docs to folks that 
>> shouldn’t read them.
>> 
>> For the rest of cRUD, the existing document must store one of the RUD-ing 
>> user’s name or role in its _access field.
>> 
>> For both creations and updates, a user MUST supply at least one role they 
>> belong to or their own username.
>> 
>> * * *
>> 
>> 
>> ## _revs_diff
>> 
>> /db/_revs_diff can answer the question of which revisions of a document do 
>> NOT exist on a replication target: 
>> http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff
>> 
>> This would allow users to specify ids and rev(s) for docs they don’t have 
>> access too (anymore), so the result schema should be expanded to handle id: 
>> unauthorized or somesuch, something the replicator needs to know what to do 
>> with, if it encounters it (say a user got removed from the _access list 
>> inbetween the replicator opening _changes and requesting the doc).
>> 
>> The _revs_diff implementation would have to altered to send an unauthorized 
>> token for each doc the requesting userCtx has no access to. If we can re-use 
>> some of our existing indexes, or any other performance optimisation, that’d 
>> be great. I haven’t looked at that code at all, yet.
>> 
>> An important side-effect of this is, once a user has been added to a doc’s 
>> _access list, they get access to “the full history of the doc”, even before 
>> they had access. Of course, in CouchDB this means only getting access to the 
>> rev ids, and not the content, but since they are content-addressable hashes, 
>> a user could brute-force themselves into revealing certain real values from 
>> earlier incarnations of the doc. I’d rather not track _access per document 
>> revision in perpetuity, so this is something we have to be very up-front 
>> about.
>> 
>> * * *
>> 
>> 
>> ## Partitioned Databases
>> 
>> I mentioned partitioned databases in my previous mail, and I think it is 
>> something we can document that end-users can opt into, but doesn’t require 
>> any special casing on the _access proposal. That is, if users start 
>> prefixing their doc ids with a user name or id and enable both _access and 
>> partitions, then they get all the benefits of a partitioned database, and if 
>> they choose not to, they don’t, but things keep working. There are enough 
>> use-cases to warrant both behaviours.
>> 
>> * * *
>> 
>> 
>> ## Scenarios that _access should help with.
>> 
>> Overall, we developed _access to allow users to stop using the db-per-user 
>> architecture, but once we have per-doc-access control, folks might start 
>> using this for all manner of things. We should be clear about which 
>> scenarios we support and which we don’t.
>> 
>> 
>> ### Scenario 1: db-per-user
>> 
>> In this scenario, _access enabled databases, the only way to allow mutually 
>> untrusting users to store data in a part of CouchDB that only they (and 
>> admins) have access to was giving each user their own database.
>> 
>> In an _access enabled database, users can CRUD/_changes/_all_docs/_revs_diff 
>> their own docs knowing no other user (aside from admins) can access those 
>> docs.
>> 
>> This is the simplest scenario, as all we’d have to track the owner of a 
>> document and produce by-access-id/seq indexes based on that owner.
>> 
>> The current prototype implementation mostly reflects this stage. Not saying 
>> this is what we should ship, but it is the easiest do implement and explain.
>> 
>> Aside, I might be able to be persuaded to ship this as a 2.x feature, to 
>> help those folks who don’t need anything else.
>> 
>> 
>> ### Scenario 2: db-per-user + Sharing
> 
> One scenario we should address is how stopping to share would work when 
> documents are continuously replicated, e.g. to a client for offline usage. My 
> understanding is that for the person who’s access to documents got revoked 
> does not get _changes update telling them that their access got removed, it 
> would be up to the application developer to implement some kind of 
> "notification" meta documents. Unless you have a better idea?

Since we now have a purge API as well, we could treat an un-share as a purge 
for clients, and they can decide what to do with it.

Alternatively, we need to make breaking changes to _changes feed, maybe we can 
hide that behind an opt-in flag, like “/db/_changes?access=true”, and then we 
can send new rows like:

{seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch.


> 
>> 
>> The second we allow per doc auth, users will want to share those docs with 
>> other users. That’s why we initially suggested the _access field be an 
>> array, so other users and groups can be specified to have access. There are 
>> multiple scenarios in this one alone:
>> 
>> #### 2.1: The Todo List
>> 
>> In this scenario, a user has a reasonable amount of ”personal data” that 
>> they want to selectively share with one or more other users.
>> 
>> #### 2.2: The Chat/Forum/Newsgroup
>> 
>> In this scenario, a user wants to share any number of documents with a 
>> reasonable number of groups. However, since we need to limit the number of 
>> groups a user belongs to (currently 10, see below for details), this might 
>> actually not be a great solution. Or folks couldn’t be in more than 10 chat 
>> groups at a time.
>> 
>> #### 2.3: The Corporate Hierarchy
>> 
>> In this scenario, users want to share any number of docs with a reasonable 
>> number of groups in a top-down/bottom-up fashion. Think CEO shares with 
>> executives, execs share with divisions, divisions report up to their one 
>> executive, etc.
>> 
>> 
>> ### 3: Multiple Apps
>> 
>> The preceding scenarios all assume that a single application is responsible 
>> for everything. However, once we allow mutually distrusting users into a 
>> single database *and* make each per-user slice work (almost) like a full 
>> standalone CouchDB database, what would stop users from using this for a 
>> multi-homing feature, where different applications are used for each user in 
>> the same database?
>> 
>> I’ll be referring to these scenarios down the line.
>> 
>> * * *
>> 
>> 
>> ## Design Docs
>> 
>> ### Admin
>> 
>> One of the downsides of db-per-user is managing design docs in the face of a 
>> changing application, that is, how to distribute new design docs across 10s 
>> of 1000+s of user dbs? It’s not impossible, but tedious. In all scenarios 
>> above but scenario 3., we could simplify this significantly. Say an admin 
>> creates a design doc, and gives all users in the db access to this design 
>> doc (this could be with the _users role, or yet another new role _members, 
>> if we need it), requesting the result of a view defined in that design doc 
>> will produce an index that is powered by the requesting user’s by-access-seq 
>> index section(s).
>> 
>> N.B., this would require us to change a fundamental assumption when doing 
>> the association between a design doc’s definition and index: normally, there 
>> is only the `views` member that is hashed and that hash is used as the 
>> index’s filename. Because there is only by-seq to power a view, that all 
>> works. But now that we have an arbitrary set of sections on by-access-seq, 
>> any view index built will have to take a user’s name and roles into account. 
>> When a user leaves a group, or gains a group, all indexes for that user will 
>> no longer be valid and need rebuilding.
>> 
>> 
>> ### User
>> 
>> In any of the scenarios above, but especially 3., there could be legitimate 
>> per-user design docs, so how should those be treated in an _access enabled 
>> database?
>> 
>> The significant fields in a design doc are `views`, `validate_doc_update` 
>> and `filters` (I’ll skip over the deprecated _show, _list, and _update).
>> 
>> The easiest to handle is a `filters`: if a user specifies a filter for a 
>> _changes request or replication that lives in a design doc they don’t have 
>> access to, they get an error, similar to if they specify a non-existent 
>> design doc, just with `unauthorized` instead of `not_found`.
>> 
>> Next `views` is also not very hard to imagine working: just like globally 
>> defined views for that db, the index is built for each user based on the 
>> user’s name and roles.
>> 
>> More troubling are `validate_doc_update` functions: One, they are already 
>> troubling in that they slow down any document updates. Two, if we now import 
>> an existing db-per-user scenario where each user has their own design docs,
> 
> I can’t think of a db-per-user scenario where each user DB would have a 
> different validate_doc_update method? It would be the same method with access 
> to the user context, the DBs security setting and the document, so it would 
> act differently for different users, but using the same code.

They wouldn’t be different, but if we were do replicate 1000 db-per-user design 
docs into a single database, as per today’s semantics, we’d have to run 1000 
VDUs on each doc update.

> 
>> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are 
>> impractical to apply on each doc update, let alone just the management of 
>> VDUs that are active on a database. One option would be to ignore VDUs if 
>> they are not defined globally (say with a _members role). But especially in 
>> scenario 3. this becomes problematic, but even without that specific 
>> scenario, this violates the no surprises best practice.
>> 
>> We could say:
>> 
>> a) we don’t support scenario 3.
> 
> +1, I think it would make our lives easier in general if we don’t recommend 
> to share the same CouchDB for multiple apps. At least I don’t see a reason to 
> do that at this point.

I think I like this best, too, but I’d like to hear from others as well.


Best
Jan
—
> 
>> b) we find a complicated but efficient way to apply only those VDUs that are 
>> defined in design docs the writing user has access to plus any global ones 
>> (this would be neat but rather complicated and potentially still impractical 
>> from a performance perspective for N users).
>> c) we could store all per-user design docs, but ignore them completely, 
>> VDUs, views and filters.
>> 
>> I think I currently fall on the side of not supporting scenario 3. and 
>> asking folks who migrate db-per-user to de-duplicate design docs and keep 
>> them per-app. I believe that is a good trade-off between the most common 
>> scenarios for db-per-user while keeping the implementation manageable. 
>> Globally accessible design docs would show up in a user’s changes feed and 
>> would replicate down to say a PouchDB application which might be the 
>> exclusive user of those design docs.
>> 
>> In practice this would mean, a document that has an _id that starts with 
>> _design/ will have to be produced by a database admin. Luckily, that’s 
>> already the case. We should just make sure that folks don’t give db-admin 
>> access to all users habitually.
>> 
>> 
>> ## Read and Write Access
>> 
>> Speaking of validate_doc_update, it is used for two things: checking 
>> document schema and doc update authorisation.
>> 
>> Once we allow access to a document with an _access field, we need to decide 
>> what kind of access this gives to a doc: read-only or read-write (I’m not 
>> considering write-only because for anything but doc creations this is not 
>> useful as you need access to the current _rev).
>> 
>> However, when we look at implementing an application on top of our existing 
>> API, it is already weird that read access can be controlled globally (or 
>> with _access on a per doc level), but write access requires writing 
>> JavaScript code. I think it would be a reasonable expectation for users to 
>> expect a per-doc read/write permission granting.
> 
> Yes!
> 
>> 
>> So we could have all of the above, but with two extra fields: _access_read 
>> and _access_write, or _access: {read: [], write: []}
> 
> I prefer this API for its compactness, thinking about offline 
> synchronization. The smaller the docs, the better.
> 
> Best
> “Gregor”
> —
> 
> 
>> or we overload user and group names: _access: [user_a:read, user_b:write] 
>> (or any permutation thereof). Overloading can cause trouble with naturally 
>> occurring characters in group names.
>> 
>> The former seems more explicit, but from an API perspective that’s a little 
>> more awkward: remember that we currently have an arbitrary limit of 10 
>> members in a user’s role array, to avoid excessive fan out on 
>> cluster-internal operations. Partitioned dbs could get away with more, more 
>> easily however. If we allow the specification of access control in two 
>> lists, and one of the lists implies membership in the other, we have a total 
>> limit of 10 members across both arrays. Or we limit 5 + 5, but that seems 
>> excessive, while 10 total seems weird, but doable. Anyway, good bikeshed.
>> 
>> 
>> * * * 
>> 
>> 
>> So far. I think all of the problems outlined are solvable, if with a clear 
>> definition of what use-cases we do not support with access. If you have more 
>> scenarios than the ones I outlined, please add them and we can see if they 
>> cause any additional trouble.
>> 
>> Thanks for reading this far and I’m looking forward to your feedback.
>> 
>> 
>> Best,
>> Jan “_access” Lehnardt
>> —
>> 
>> 
>> 
>> 
>>> On 17. Feb 2019, at 15:25, Jan Lehnardt <j...@apache.org> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I’m happy to share my work in progress attempt to implement the per-doc 
>>> access control feature we discussed a good while ago:
>>> 
>>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E
>>>  
>>> <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
>>> 
>>> You can check out my branch here:
>>> 
>>> https://github.com/apache/couchdb/compare/access?expand=1 
>>> <https://github.com/apache/couchdb/compare/access?expand=1>
>>> 
>>> It is very much work in progress, but it is far enough along to warrant 
>>> discussion.
>>> 
>>> The main point of this branch is to show all the places that we would need 
>>> to change to support the proposal.
>>> 
>>> Things I’ve left for later:
>>> 
>>> - currently only the first element in the _access array is used. Our and/or 
>>> syntax can be added later.
>>> - building per-access views has not been implemented yet, couch_index would 
>>> have to be taught about the new per-access-id index.
>>> - pretty HTTP error handling
>>> - tests except for a tiny shell script 😇
>>> 
>>> Implementation notes:
>>> 
>>> You create a database with the _access feature turned on like so:  PUT 
>>> /db?access=true
>>> 
>>> I started out with storing _access in the document body, as that would 
>>> allow for a minimal change set, however, on doc updates, we try hard not to 
>>> load the old doc body from the database, and forcing us to do so for EVERY 
>>> doc update under _access seemed prohibitive, so I extended the #doc, 
>>> #doc_info and #full_doc_info records with a new `access` attribute that is 
>>> stored in both by-id and by-seq. I will need guidance on how extending 
>>> these records impact multi-version cluster interop. And especially whether 
>>> this is an acceptable approach.
>>> 
>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
>>> 
>>> * * *
>>> 
>>> The main addition is a new native query server called 
>>> couch_access_native_proc, which implements two new indexes by-access-id and 
>>> by-access-seq which do what you’d expect, pass in a userCtx and retrieve 
>>> the equivalent of _all_docs or _changes, but only including those docs that 
>>> match the username and roles in their _access property. The existing 
>>> handlers for _all_docs and _changes have been augmented to use the new 
>>> indexes instead of the default ones, unless the user is an admin.
>>> 
>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
>>> 
>>> * * *
>>> 
>>> The rest of the diff is concerned with making document CRUD behave as you’d 
>>> expect it. See this little demonstration for what things look like:
>>> 
>>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 
>>> <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just 
>>> noticing that there might be something wonky with DELETE, but you’ll get 
>>> the gist #rimshot)
>>> 
>>> * * *
>>> 
>>> Open questions:
>>> 
>>> - The aim of this is to get as close to regular CouchDB behaviour as 
>>> possible. One thing that is new however which would require all apps to be 
>>> changed is that for an _access enabled database to include an _access field 
>>> in their docs (docs with no _access are admin-only for now). We might want 
>>> to consider on new document writes to auto-insert the authenticated user’s 
>>> name as the first element in the _access array, so existing apps “just 
>>> work”.
>>> 
>>> - Interplay with partitioned dbs: eschewing db-per-user is already a large 
>>> boon if you have a lot of users, but making those per-user requests inside 
>>> an _access enabled database efficient would be doubly nice, so why not use 
>>> the username from the first question above and use that as the partition 
>>> key? This would work nicely for natural users with their own docs that want 
>>> to share them with others later, but I can easily imagine a pipelined use 
>>> of CouchDB, where a “collector” user creates all new docs, an “analyser” 
>>> takes them over and hand them to a “result” user for viewing. In that case, 
>>> we’d violate the high-cardinality rule of partitions (have a lot of small 
>>> ones), instead all docs go through all three users. I’d be okay with 
>>> treating the later scenario as a minor use-case, but for that use-case, we 
>>> should be able to disable auto-partitioning on db creation.
>>> 
>>> - building access view indexes for docs that have frequent _access changes, 
>>> lead to many orphaned view indexes, we should look at an auto-cleanup 
>>> solution here (maybe keep 1-N indexes in case folks just swap back and 
>>> forth).
>>> 
>>> * * *
>>> 
>>> I’ll leave this here for now, I’m sure there are a few more things to 
>>> consider.
>>> 
>>> I’d love to hear any and all feedback you might have. Especially if 
>>> anything is unclear.
>>> 
>>> Best
>>> Jan
>>> —
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/
>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Per-doc access control

Reply via email to