Re: [DISCUSS] Per-doc access control

Adam Kocoloski Wed, 03 Apr 2019 14:08:10 -0700

I’m also in favor of dropping Scenario 3.

One topic we may have discussed in the past but I wanted to close out here: in 
the relational database world it’s not uncommon to use materialized views as an 
access control mechanism to selectively expose contents of a table to clients 
who cannot access the table directly. Does the current thinking on _access for 
views support that use case? Can we build a view using a set of roles inherited 
from the user who created the design doc, but then turn around and set the 
_access on the view itself to a less-restrictive set?


On the _revs_diff topic — I’m not all that concerned about users trying to 
guess revision IDs that exist on the server, and then reverse-engineer the 
contents of the existing revisions. Maybe I ought to be.

On a somewhat-related note, I have had conversations before with folks who are 
keen to adopt these sorts of fine-grained access control systems who said they 
actually prefer to have a 403 Forbidden response list the set of privileges 
that would be sufficient to access the resource. I found this surprising, but I 
guess it comes down to a user needing to figure out what kind of security 
exception to apply for in order to make progress with some data analysis. I 
think this is a topic on which we could make a fairly late-binding decision — 
or even have it as a configurable option.

I could definitely see the base Scenario 1 (single _access labels) landing 
ahead of the more-complex sharing models.

I haven’t had a chance to take a deep look at the code but the design seems 
good and thoughtful, and I definitely like the focus on the use cases.

Adam

> On Mar 14, 2019, at 11:21 AM, Jan Lehnardt <[email protected]> wrote:
> 
> My replies now inline.
> 
>> On 14. Mar 2019, at 16:13, Jan Lehnardt <[email protected]> wrote:
>> 
>> I received some notes privately from Gregor Martynus, which I’m reproducing 
>> here in email thread form. This email is all Gregor’s notes, my next email 
>> is my replies to them.
>> 
>>> On 10. Mar 2019, at 15:51, Jan Lehnardt <[email protected]> wrote:
>>> 
>>> Hey all,
>>> 
>>> after mulling this over some more, I’d like to tackle the detailed API and 
>>> behaviour for this. Especially how _access work in conjunction with 
>>> existing access control features.
>>> 
>>> My guiding principles so far are:
>>> 
>>> 1. Make the API intuitive, things should work like they look like they 
>>> should work like.
>>> 2. The default should never be that a resources is accidentally left 
>>> accessible to the public.
>>> 3. This should work as a natural extension to the existing security 
>>> features*.
>>> 
>>> * I’d be up for reworking the whole lot, too, but that might be a better 
>>> discussion for > 4.0.
>>> 
>>> 
>>> ## Database Creation and Default Behaviours
>>> 
>>> Creating a database with _access features is, as mentioned before done via 
>>> a flag to PUT /database?access=true
>>> 
>>> In a 3.0 world where this would land, we already agreed that databases 
>>> should be admin-only by default (instead of world read/writeable today). 
>>> This is a sensible default, but that leaves us with an _access enabled 
>>> database that can’t be used by anyone by server or db admins. Not very 
>>> useful.
>>> 
>>> To allow arbitrary users to use the db, I suggest we use the existing 
>>> _security system: i.e. if a user or a group a user belongs to is mentioned 
>>> in either `admins` or `members` inside of _security, they can proceed and 
>>> create documents on the db. This puts a second step burden on the 
>>> application developer, but it slots cleanly into the existing security 
>>> mechanisms, and doesn’t require special case handling. Alternatively, we 
>>> could define that _security isn’t available in _access enabled databases, 
>>> but that’s something I’d like to avoid if at all possible.
>>> 
>>> In order to make it easy to specify that “everyone in _users” should be 
>>> able to use the db, I suggest we add a new role `_users` that is valid 
>>> inside _security, which means “everyone in /_users” (this only excludes 
>>> server admins which have full access anyway).
>>> 
>>> * * *
>>> 
>>> 
>>> ## Document Creation and Access Control
>>> 
>>> Next, one of our non-admin users creates a doc. There are multiple options 
>>> as to how we store the _access information.
>>> 
>>> 1. Automatically translate the userCtx.name of a doc creation (not an 
>>> update) into the first element of the _access array. E.g. user_a PUT 
>>> /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a 
>>> little bit counter-intuitive.
>>> 
>>> 2. We require that a user puts "_access":["user_a"] in themselves. This is 
>>> an explicit granting of access permissions on doc creation and I think is 
>>> preferable.
>> 
>> I prefer being explicit.
>> 
>> 
>>> 
>>> This leaves the edge case of docs that have no _access member: so far I 
>>> thought those docs are admin-only, with maybe a db-wide option to swap the 
>>> default to public access, but I think given the explicitness of 2. we can 
>>> do better: require _access for all new doc creations in access-enabled 
>>> databases. A user can not create a new document without an _access field 
>>> that is an array that has at least one member. For public documents, we 
>>> could invent a new role _public, and admin-only docs could use the existing 
>>> role _admin.
>>> 
>>> The one downside to this approach is that we won’t be able to replicate 
>>> existing databases into an access-enabled database without modifying all 
>>> documents. This might be a worthwhile trade-off, but we should make that 
>>> decision consciously and document it well.
>> 
>> We could also provide tooling for migrations?
> 
> I’d love tooling, but we’d have to make sure we can do it correctly for a big 
> number of use-cases. For the acceptance of this change, I’d make “documenting 
> a migration path for db-per-user setups” a MUST have, and any code that helps 
> with that a nice to have.
> 
>> 
>> 
>>> We could allow for a special case where an _admin user can create docs that 
>>> have no _access field, and those docs are treated as having only the _admin 
>>> role in _access. So at least we could replicate all data in, but then 
>>> require a manual step to update all docs to say, migrate an existing 
>>> db-per-user app, while not accidentally exposing any docs to folks that 
>>> shouldn’t read them.
>>> 
>>> For the rest of cRUD, the existing document must store one of the RUD-ing 
>>> user’s name or role in its _access field.
>>> 
>>> For both creations and updates, a user MUST supply at least one role they 
>>> belong to or their own username.
>>> 
>>> * * *
>>> 
>>> 
>>> ## _revs_diff
>>> 
>>> /db/_revs_diff can answer the question of which revisions of a document do 
>>> NOT exist on a replication target: 
>>> http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff
>>> 
>>> This would allow users to specify ids and rev(s) for docs they don’t have 
>>> access too (anymore), so the result schema should be expanded to handle id: 
>>> unauthorized or somesuch, something the replicator needs to know what to do 
>>> with, if it encounters it (say a user got removed from the _access list 
>>> inbetween the replicator opening _changes and requesting the doc).
>>> 
>>> The _revs_diff implementation would have to altered to send an unauthorized 
>>> token for each doc the requesting userCtx has no access to. If we can 
>>> re-use some of our existing indexes, or any other performance optimisation, 
>>> that’d be great. I haven’t looked at that code at all, yet.
>>> 
>>> An important side-effect of this is, once a user has been added to a doc’s 
>>> _access list, they get access to “the full history of the doc”, even before 
>>> they had access. Of course, in CouchDB this means only getting access to 
>>> the rev ids, and not the content, but since they are content-addressable 
>>> hashes, a user could brute-force themselves into revealing certain real 
>>> values from earlier incarnations of the doc. I’d rather not track _access 
>>> per document revision in perpetuity, so this is something we have to be 
>>> very up-front about.
>>> 
>>> * * *
>>> 
>>> 
>>> ## Partitioned Databases
>>> 
>>> I mentioned partitioned databases in my previous mail, and I think it is 
>>> something we can document that end-users can opt into, but doesn’t require 
>>> any special casing on the _access proposal. That is, if users start 
>>> prefixing their doc ids with a user name or id and enable both _access and 
>>> partitions, then they get all the benefits of a partitioned database, and 
>>> if they choose not to, they don’t, but things keep working. There are 
>>> enough use-cases to warrant both behaviours.
>>> 
>>> * * *
>>> 
>>> 
>>> ## Scenarios that _access should help with.
>>> 
>>> Overall, we developed _access to allow users to stop using the db-per-user 
>>> architecture, but once we have per-doc-access control, folks might start 
>>> using this for all manner of things. We should be clear about which 
>>> scenarios we support and which we don’t.
>>> 
>>> 
>>> ### Scenario 1: db-per-user
>>> 
>>> In this scenario, _access enabled databases, the only way to allow mutually 
>>> untrusting users to store data in a part of CouchDB that only they (and 
>>> admins) have access to was giving each user their own database.
>>> 
>>> In an _access enabled database, users can 
>>> CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user 
>>> (aside from admins) can access those docs.
>>> 
>>> This is the simplest scenario, as all we’d have to track the owner of a 
>>> document and produce by-access-id/seq indexes based on that owner.
>>> 
>>> The current prototype implementation mostly reflects this stage. Not saying 
>>> this is what we should ship, but it is the easiest do implement and explain.
>>> 
>>> Aside, I might be able to be persuaded to ship this as a 2.x feature, to 
>>> help those folks who don’t need anything else.
>>> 
>>> 
>>> ### Scenario 2: db-per-user + Sharing
>> 
>> One scenario we should address is how stopping to share would work when 
>> documents are continuously replicated, e.g. to a client for offline usage. 
>> My understanding is that for the person who’s access to documents got 
>> revoked does not get _changes update telling them that their access got 
>> removed, it would be up to the application developer to implement some kind 
>> of "notification" meta documents. Unless you have a better idea?
> 
> Since we now have a purge API as well, we could treat an un-share as a purge 
> for clients, and they can decide what to do with it.
> 
> Alternatively, we need to make breaking changes to _changes feed, maybe we 
> can hide that behind an opt-in flag, like “/db/_changes?access=true”, and 
> then we can send new rows like:
> 
> {seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch.
> 
> 
>> 
>>> 
>>> The second we allow per doc auth, users will want to share those docs with 
>>> other users. That’s why we initially suggested the _access field be an 
>>> array, so other users and groups can be specified to have access. There are 
>>> multiple scenarios in this one alone:
>>> 
>>> #### 2.1: The Todo List
>>> 
>>> In this scenario, a user has a reasonable amount of ”personal data” that 
>>> they want to selectively share with one or more other users.
>>> 
>>> #### 2.2: The Chat/Forum/Newsgroup
>>> 
>>> In this scenario, a user wants to share any number of documents with a 
>>> reasonable number of groups. However, since we need to limit the number of 
>>> groups a user belongs to (currently 10, see below for details), this might 
>>> actually not be a great solution. Or folks couldn’t be in more than 10 chat 
>>> groups at a time.
>>> 
>>> #### 2.3: The Corporate Hierarchy
>>> 
>>> In this scenario, users want to share any number of docs with a reasonable 
>>> number of groups in a top-down/bottom-up fashion. Think CEO shares with 
>>> executives, execs share with divisions, divisions report up to their one 
>>> executive, etc.
>>> 
>>> 
>>> ### 3: Multiple Apps
>>> 
>>> The preceding scenarios all assume that a single application is responsible 
>>> for everything. However, once we allow mutually distrusting users into a 
>>> single database *and* make each per-user slice work (almost) like a full 
>>> standalone CouchDB database, what would stop users from using this for a 
>>> multi-homing feature, where different applications are used for each user 
>>> in the same database?
>>> 
>>> I’ll be referring to these scenarios down the line.
>>> 
>>> * * *
>>> 
>>> 
>>> ## Design Docs
>>> 
>>> ### Admin
>>> 
>>> One of the downsides of db-per-user is managing design docs in the face of 
>>> a changing application, that is, how to distribute new design docs across 
>>> 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all 
>>> scenarios above but scenario 3., we could simplify this significantly. Say 
>>> an admin creates a design doc, and gives all users in the db access to this 
>>> design doc (this could be with the _users role, or yet another new role 
>>> _members, if we need it), requesting the result of a view defined in that 
>>> design doc will produce an index that is powered by the requesting user’s 
>>> by-access-seq index section(s).
>>> 
>>> N.B., this would require us to change a fundamental assumption when doing 
>>> the association between a design doc’s definition and index: normally, 
>>> there is only the `views` member that is hashed and that hash is used as 
>>> the index’s filename. Because there is only by-seq to power a view, that 
>>> all works. But now that we have an arbitrary set of sections on 
>>> by-access-seq, any view index built will have to take a user’s name and 
>>> roles into account. When a user leaves a group, or gains a group, all 
>>> indexes for that user will no longer be valid and need rebuilding.
>>> 
>>> 
>>> ### User
>>> 
>>> In any of the scenarios above, but especially 3., there could be legitimate 
>>> per-user design docs, so how should those be treated in an _access enabled 
>>> database?
>>> 
>>> The significant fields in a design doc are `views`, `validate_doc_update` 
>>> and `filters` (I’ll skip over the deprecated _show, _list, and _update).
>>> 
>>> The easiest to handle is a `filters`: if a user specifies a filter for a 
>>> _changes request or replication that lives in a design doc they don’t have 
>>> access to, they get an error, similar to if they specify a non-existent 
>>> design doc, just with `unauthorized` instead of `not_found`.
>>> 
>>> Next `views` is also not very hard to imagine working: just like globally 
>>> defined views for that db, the index is built for each user based on the 
>>> user’s name and roles.
>>> 
>>> More troubling are `validate_doc_update` functions: One, they are already 
>>> troubling in that they slow down any document updates. Two, if we now 
>>> import an existing db-per-user scenario where each user has their own 
>>> design docs,
>> 
>> I can’t think of a db-per-user scenario where each user DB would have a 
>> different validate_doc_update method? It would be the same method with 
>> access to the user context, the DBs security setting and the document, so it 
>> would act differently for different users, but using the same code.
> 
> They wouldn’t be different, but if we were do replicate 1000 db-per-user 
> design docs into a single database, as per today’s semantics, we’d have to 
> run 1000 VDUs on each doc update.
> 
>> 
>>> how should we apply validate_doc_update functions? 10s of 1000s of VDUs are 
>>> impractical to apply on each doc update, let alone just the management of 
>>> VDUs that are active on a database. One option would be to ignore VDUs if 
>>> they are not defined globally (say with a _members role). But especially in 
>>> scenario 3. this becomes problematic, but even without that specific 
>>> scenario, this violates the no surprises best practice.
>>> 
>>> We could say:
>>> 
>>> a) we don’t support scenario 3.
>> 
>> +1, I think it would make our lives easier in general if we don’t recommend 
>> to share the same CouchDB for multiple apps. At least I don’t see a reason 
>> to do that at this point.
> 
> I think I like this best, too, but I’d like to hear from others as well.
> 
> 
> Best
> Jan
> —
>> 
>>> b) we find a complicated but efficient way to apply only those VDUs that 
>>> are defined in design docs the writing user has access to plus any global 
>>> ones (this would be neat but rather complicated and potentially still 
>>> impractical from a performance perspective for N users).
>>> c) we could store all per-user design docs, but ignore them completely, 
>>> VDUs, views and filters.
>>> 
>>> I think I currently fall on the side of not supporting scenario 3. and 
>>> asking folks who migrate db-per-user to de-duplicate design docs and keep 
>>> them per-app. I believe that is a good trade-off between the most common 
>>> scenarios for db-per-user while keeping the implementation manageable. 
>>> Globally accessible design docs would show up in a user’s changes feed and 
>>> would replicate down to say a PouchDB application which might be the 
>>> exclusive user of those design docs.
>>> 
>>> In practice this would mean, a document that has an _id that starts with 
>>> _design/ will have to be produced by a database admin. Luckily, that’s 
>>> already the case. We should just make sure that folks don’t give db-admin 
>>> access to all users habitually.
>>> 
>>> 
>>> ## Read and Write Access
>>> 
>>> Speaking of validate_doc_update, it is used for two things: checking 
>>> document schema and doc update authorisation.
>>> 
>>> Once we allow access to a document with an _access field, we need to decide 
>>> what kind of access this gives to a doc: read-only or read-write (I’m not 
>>> considering write-only because for anything but doc creations this is not 
>>> useful as you need access to the current _rev).
>>> 
>>> However, when we look at implementing an application on top of our existing 
>>> API, it is already weird that read access can be controlled globally (or 
>>> with _access on a per doc level), but write access requires writing 
>>> JavaScript code. I think it would be a reasonable expectation for users to 
>>> expect a per-doc read/write permission granting.
>> 
>> Yes!
>> 
>>> 
>>> So we could have all of the above, but with two extra fields: _access_read 
>>> and _access_write, or _access: {read: [], write: []}
>> 
>> I prefer this API for its compactness, thinking about offline 
>> synchronization. The smaller the docs, the better.
>> 
>> Best
>> “Gregor”
>> —
>> 
>> 
>>> or we overload user and group names: _access: [user_a:read, user_b:write] 
>>> (or any permutation thereof). Overloading can cause trouble with naturally 
>>> occurring characters in group names.
>>> 
>>> The former seems more explicit, but from an API perspective that’s a little 
>>> more awkward: remember that we currently have an arbitrary limit of 10 
>>> members in a user’s role array, to avoid excessive fan out on 
>>> cluster-internal operations. Partitioned dbs could get away with more, more 
>>> easily however. If we allow the specification of access control in two 
>>> lists, and one of the lists implies membership in the other, we have a 
>>> total limit of 10 members across both arrays. Or we limit 5 + 5, but that 
>>> seems excessive, while 10 total seems weird, but doable. Anyway, good 
>>> bikeshed.
>>> 
>>> 
>>> * * * 
>>> 
>>> 
>>> So far. I think all of the problems outlined are solvable, if with a clear 
>>> definition of what use-cases we do not support with access. If you have 
>>> more scenarios than the ones I outlined, please add them and we can see if 
>>> they cause any additional trouble.
>>> 
>>> Thanks for reading this far and I’m looking forward to your feedback.
>>> 
>>> 
>>> Best,
>>> Jan “_access” Lehnardt
>>> —
>>> 
>>> 
>>> 
>>> 
>>>> On 17. Feb 2019, at 15:25, Jan Lehnardt <[email protected]> wrote:
>>>> 
>>>> Hi Everyone,
>>>> 
>>>> I’m happy to share my work in progress attempt to implement the per-doc 
>>>> access control feature we discussed a good while ago:
>>>> 
>>>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E
>>>>  
>>>> <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E>
>>>> 
>>>> You can check out my branch here:
>>>> 
>>>> https://github.com/apache/couchdb/compare/access?expand=1 
>>>> <https://github.com/apache/couchdb/compare/access?expand=1>
>>>> 
>>>> It is very much work in progress, but it is far enough along to warrant 
>>>> discussion.
>>>> 
>>>> The main point of this branch is to show all the places that we would need 
>>>> to change to support the proposal.
>>>> 
>>>> Things I’ve left for later:
>>>> 
>>>> - currently only the first element in the _access array is used. Our 
>>>> and/or syntax can be added later.
>>>> - building per-access views has not been implemented yet, couch_index 
>>>> would have to be taught about the new per-access-id index.
>>>> - pretty HTTP error handling
>>>> - tests except for a tiny shell script 😇
>>>> 
>>>> Implementation notes:
>>>> 
>>>> You create a database with the _access feature turned on like so:  PUT 
>>>> /db?access=true
>>>> 
>>>> I started out with storing _access in the document body, as that would 
>>>> allow for a minimal change set, however, on doc updates, we try hard not 
>>>> to load the old doc body from the database, and forcing us to do so for 
>>>> EVERY doc update under _access seemed prohibitive, so I extended the #doc, 
>>>> #doc_info and #full_doc_info records with a new `access` attribute that is 
>>>> stored in both by-id and by-seq. I will need guidance on how extending 
>>>> these records impact multi-version cluster interop. And especially whether 
>>>> this is an acceptable approach.
>>>> 
>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36
>>>> 
>>>> * * *
>>>> 
>>>> The main addition is a new native query server called 
>>>> couch_access_native_proc, which implements two new indexes by-access-id 
>>>> and by-access-seq which do what you’d expect, pass in a userCtx and 
>>>> retrieve the equivalent of _all_docs or _changes, but only including those 
>>>> docs that match the username and roles in their _access property. The 
>>>> existing handlers for _all_docs and _changes have been augmented to use 
>>>> the new indexes instead of the default ones, unless the user is an admin.
>>>> 
>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4
>>>> 
>>>> * * *
>>>> 
>>>> The rest of the diff is concerned with making document CRUD behave as 
>>>> you’d expect it. See this little demonstration for what things look like:
>>>> 
>>>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 
>>>> <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just 
>>>> noticing that there might be something wonky with DELETE, but you’ll get 
>>>> the gist #rimshot)
>>>> 
>>>> * * *
>>>> 
>>>> Open questions:
>>>> 
>>>> - The aim of this is to get as close to regular CouchDB behaviour as 
>>>> possible. One thing that is new however which would require all apps to be 
>>>> changed is that for an _access enabled database to include an _access 
>>>> field in their docs (docs with no _access are admin-only for now). We 
>>>> might want to consider on new document writes to auto-insert the 
>>>> authenticated user’s name as the first element in the _access array, so 
>>>> existing apps “just work”.
>>>> 
>>>> - Interplay with partitioned dbs: eschewing db-per-user is already a large 
>>>> boon if you have a lot of users, but making those per-user requests inside 
>>>> an _access enabled database efficient would be doubly nice, so why not use 
>>>> the username from the first question above and use that as the partition 
>>>> key? This would work nicely for natural users with their own docs that 
>>>> want to share them with others later, but I can easily imagine a pipelined 
>>>> use of CouchDB, where a “collector” user creates all new docs, an 
>>>> “analyser” takes them over and hand them to a “result” user for viewing. 
>>>> In that case, we’d violate the high-cardinality rule of partitions (have a 
>>>> lot of small ones), instead all docs go through all three users. I’d be 
>>>> okay with treating the later scenario as a minor use-case, but for that 
>>>> use-case, we should be able to disable auto-partitioning on db creation.
>>>> 
>>>> - building access view indexes for docs that have frequent _access 
>>>> changes, lead to many orphaned view indexes, we should look at an 
>>>> auto-cleanup solution here (maybe keep 1-N indexes in case folks just swap 
>>>> back and forth).
>>>> 
>>>> * * *
>>>> 
>>>> I’ll leave this here for now, I’m sure there are a few more things to 
>>>> consider.
>>>> 
>>>> I’d love to hear any and all feedback you might have. Especially if 
>>>> anything is unclear.
>>>> 
>>>> Best
>>>> Jan
>>>> —
>>> 
>>> -- 
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/
>>> 
>> 
>> -- 
>> Professional Support for Apache CouchDB:
>> https://neighbourhood.ie/couchdb-support/
>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/ 
> <https://neighbourhood.ie/couchdb-support/>

Re: [DISCUSS] Per-doc access control

Reply via email to