Thanks for your initial comments. > On 3. Apr 2019, at 23:07, Adam Kocoloski <kocol...@apache.org> wrote: > > I’m also in favor of dropping Scenario 3. > > One topic we may have discussed in the past but I wanted to close out here: > in the relational database world it’s not uncommon to use materialized views > as an access control mechanism to selectively expose contents of a table to > clients who cannot access the table directly. Does the current thinking on > _access for views support that use case? Can we build a view using a set of > roles inherited from the user who created the design doc, but then turn > around and set the _access on the view itself to a less-restrictive set?
3 minutes thinking it over didn’t reveal any particular problems with this feature, aside from include_docs not working as expected, which might be an okay trade-off for now. But could be included later. > On the _revs_diff topic — I’m not all that concerned about users trying to > guess revision IDs that exist on the server, and then reverse-engineer the > contents of the existing revisions. Maybe I ought to be. I’m not particularly worried, but it is at least a theoretical situation where our user’s can be caught with their pants down when they didn’t expect it. All I want to make sure is to document this properly. C.f. git where if you get access to a repo, you get the whole history, not just the state from where you started having access. Best Jan — > > On a somewhat-related note, I have had conversations before with folks who > are keen to adopt these sorts of fine-grained access control systems who said > they actually prefer to have a 403 Forbidden response list the set of > privileges that would be sufficient to access the resource. I found this > surprising, but I guess it comes down to a user needing to figure out what > kind of security exception to apply for in order to make progress with some > data analysis. I think this is a topic on which we could make a fairly > late-binding decision — or even have it as a configurable option. > > I could definitely see the base Scenario 1 (single _access labels) landing > ahead of the more-complex sharing models. > > I haven’t had a chance to take a deep look at the code but the design seems > good and thoughtful, and I definitely like the focus on the use cases. > > Adam > >> On Mar 14, 2019, at 11:21 AM, Jan Lehnardt <j...@apache.org> wrote: >> >> My replies now inline. >> >>> On 14. Mar 2019, at 16:13, Jan Lehnardt <j...@apache.org> wrote: >>> >>> I received some notes privately from Gregor Martynus, which I’m reproducing >>> here in email thread form. This email is all Gregor’s notes, my next email >>> is my replies to them. >>> >>>> On 10. Mar 2019, at 15:51, Jan Lehnardt <j...@apache.org> wrote: >>>> >>>> Hey all, >>>> >>>> after mulling this over some more, I’d like to tackle the detailed API and >>>> behaviour for this. Especially how _access work in conjunction with >>>> existing access control features. >>>> >>>> My guiding principles so far are: >>>> >>>> 1. Make the API intuitive, things should work like they look like they >>>> should work like. >>>> 2. The default should never be that a resources is accidentally left >>>> accessible to the public. >>>> 3. This should work as a natural extension to the existing security >>>> features*. >>>> >>>> * I’d be up for reworking the whole lot, too, but that might be a better >>>> discussion for > 4.0. >>>> >>>> >>>> ## Database Creation and Default Behaviours >>>> >>>> Creating a database with _access features is, as mentioned before done via >>>> a flag to PUT /database?access=true >>>> >>>> In a 3.0 world where this would land, we already agreed that databases >>>> should be admin-only by default (instead of world read/writeable today). >>>> This is a sensible default, but that leaves us with an _access enabled >>>> database that can’t be used by anyone by server or db admins. Not very >>>> useful. >>>> >>>> To allow arbitrary users to use the db, I suggest we use the existing >>>> _security system: i.e. if a user or a group a user belongs to is mentioned >>>> in either `admins` or `members` inside of _security, they can proceed and >>>> create documents on the db. This puts a second step burden on the >>>> application developer, but it slots cleanly into the existing security >>>> mechanisms, and doesn’t require special case handling. Alternatively, we >>>> could define that _security isn’t available in _access enabled databases, >>>> but that’s something I’d like to avoid if at all possible. >>>> >>>> In order to make it easy to specify that “everyone in _users” should be >>>> able to use the db, I suggest we add a new role `_users` that is valid >>>> inside _security, which means “everyone in /_users” (this only excludes >>>> server admins which have full access anyway). >>>> >>>> * * * >>>> >>>> >>>> ## Document Creation and Access Control >>>> >>>> Next, one of our non-admin users creates a doc. There are multiple options >>>> as to how we store the _access information. >>>> >>>> 1. Automatically translate the userCtx.name of a doc creation (not an >>>> update) into the first element of the _access array. E.g. user_a PUT >>>> /db/doc {"a":1} creates this doc: {"a":1,"_access":["user_a"]}. This is a >>>> little bit counter-intuitive. >>>> >>>> 2. We require that a user puts "_access":["user_a"] in themselves. This is >>>> an explicit granting of access permissions on doc creation and I think is >>>> preferable. >>> >>> I prefer being explicit. >>> >>> >>>> >>>> This leaves the edge case of docs that have no _access member: so far I >>>> thought those docs are admin-only, with maybe a db-wide option to swap the >>>> default to public access, but I think given the explicitness of 2. we can >>>> do better: require _access for all new doc creations in access-enabled >>>> databases. A user can not create a new document without an _access field >>>> that is an array that has at least one member. For public documents, we >>>> could invent a new role _public, and admin-only docs could use the >>>> existing role _admin. >>>> >>>> The one downside to this approach is that we won’t be able to replicate >>>> existing databases into an access-enabled database without modifying all >>>> documents. This might be a worthwhile trade-off, but we should make that >>>> decision consciously and document it well. >>> >>> We could also provide tooling for migrations? >> >> I’d love tooling, but we’d have to make sure we can do it correctly for a >> big number of use-cases. For the acceptance of this change, I’d make >> “documenting a migration path for db-per-user setups” a MUST have, and any >> code that helps with that a nice to have. >> >>> >>> >>>> We could allow for a special case where an _admin user can create docs >>>> that have no _access field, and those docs are treated as having only the >>>> _admin role in _access. So at least we could replicate all data in, but >>>> then require a manual step to update all docs to say, migrate an existing >>>> db-per-user app, while not accidentally exposing any docs to folks that >>>> shouldn’t read them. >>>> >>>> For the rest of cRUD, the existing document must store one of the RUD-ing >>>> user’s name or role in its _access field. >>>> >>>> For both creations and updates, a user MUST supply at least one role they >>>> belong to or their own username. >>>> >>>> * * * >>>> >>>> >>>> ## _revs_diff >>>> >>>> /db/_revs_diff can answer the question of which revisions of a document do >>>> NOT exist on a replication target: >>>> http://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-diff >>>> >>>> This would allow users to specify ids and rev(s) for docs they don’t have >>>> access too (anymore), so the result schema should be expanded to handle >>>> id: unauthorized or somesuch, something the replicator needs to know what >>>> to do with, if it encounters it (say a user got removed from the _access >>>> list inbetween the replicator opening _changes and requesting the doc). >>>> >>>> The _revs_diff implementation would have to altered to send an >>>> unauthorized token for each doc the requesting userCtx has no access to. >>>> If we can re-use some of our existing indexes, or any other performance >>>> optimisation, that’d be great. I haven’t looked at that code at all, yet. >>>> >>>> An important side-effect of this is, once a user has been added to a doc’s >>>> _access list, they get access to “the full history of the doc”, even >>>> before they had access. Of course, in CouchDB this means only getting >>>> access to the rev ids, and not the content, but since they are >>>> content-addressable hashes, a user could brute-force themselves into >>>> revealing certain real values from earlier incarnations of the doc. I’d >>>> rather not track _access per document revision in perpetuity, so this is >>>> something we have to be very up-front about. >>>> >>>> * * * >>>> >>>> >>>> ## Partitioned Databases >>>> >>>> I mentioned partitioned databases in my previous mail, and I think it is >>>> something we can document that end-users can opt into, but doesn’t require >>>> any special casing on the _access proposal. That is, if users start >>>> prefixing their doc ids with a user name or id and enable both _access and >>>> partitions, then they get all the benefits of a partitioned database, and >>>> if they choose not to, they don’t, but things keep working. There are >>>> enough use-cases to warrant both behaviours. >>>> >>>> * * * >>>> >>>> >>>> ## Scenarios that _access should help with. >>>> >>>> Overall, we developed _access to allow users to stop using the db-per-user >>>> architecture, but once we have per-doc-access control, folks might start >>>> using this for all manner of things. We should be clear about which >>>> scenarios we support and which we don’t. >>>> >>>> >>>> ### Scenario 1: db-per-user >>>> >>>> In this scenario, _access enabled databases, the only way to allow >>>> mutually untrusting users to store data in a part of CouchDB that only >>>> they (and admins) have access to was giving each user their own database. >>>> >>>> In an _access enabled database, users can >>>> CRUD/_changes/_all_docs/_revs_diff their own docs knowing no other user >>>> (aside from admins) can access those docs. >>>> >>>> This is the simplest scenario, as all we’d have to track the owner of a >>>> document and produce by-access-id/seq indexes based on that owner. >>>> >>>> The current prototype implementation mostly reflects this stage. Not >>>> saying this is what we should ship, but it is the easiest do implement and >>>> explain. >>>> >>>> Aside, I might be able to be persuaded to ship this as a 2.x feature, to >>>> help those folks who don’t need anything else. >>>> >>>> >>>> ### Scenario 2: db-per-user + Sharing >>> >>> One scenario we should address is how stopping to share would work when >>> documents are continuously replicated, e.g. to a client for offline usage. >>> My understanding is that for the person who’s access to documents got >>> revoked does not get _changes update telling them that their access got >>> removed, it would be up to the application developer to implement some kind >>> of "notification" meta documents. Unless you have a better idea? >> >> Since we now have a purge API as well, we could treat an un-share as a purge >> for clients, and they can decide what to do with it. >> >> Alternatively, we need to make breaking changes to _changes feed, maybe we >> can hide that behind an opt-in flag, like “/db/_changes?access=true”, and >> then we can send new rows like: >> >> {seq: XYZ, id: abc, rev:4-YYY, _revoked: true} or somesuch. >> >> >>> >>>> >>>> The second we allow per doc auth, users will want to share those docs with >>>> other users. That’s why we initially suggested the _access field be an >>>> array, so other users and groups can be specified to have access. There >>>> are multiple scenarios in this one alone: >>>> >>>> #### 2.1: The Todo List >>>> >>>> In this scenario, a user has a reasonable amount of ”personal data” that >>>> they want to selectively share with one or more other users. >>>> >>>> #### 2.2: The Chat/Forum/Newsgroup >>>> >>>> In this scenario, a user wants to share any number of documents with a >>>> reasonable number of groups. However, since we need to limit the number of >>>> groups a user belongs to (currently 10, see below for details), this might >>>> actually not be a great solution. Or folks couldn’t be in more than 10 >>>> chat groups at a time. >>>> >>>> #### 2.3: The Corporate Hierarchy >>>> >>>> In this scenario, users want to share any number of docs with a reasonable >>>> number of groups in a top-down/bottom-up fashion. Think CEO shares with >>>> executives, execs share with divisions, divisions report up to their one >>>> executive, etc. >>>> >>>> >>>> ### 3: Multiple Apps >>>> >>>> The preceding scenarios all assume that a single application is >>>> responsible for everything. However, once we allow mutually distrusting >>>> users into a single database *and* make each per-user slice work (almost) >>>> like a full standalone CouchDB database, what would stop users from using >>>> this for a multi-homing feature, where different applications are used for >>>> each user in the same database? >>>> >>>> I’ll be referring to these scenarios down the line. >>>> >>>> * * * >>>> >>>> >>>> ## Design Docs >>>> >>>> ### Admin >>>> >>>> One of the downsides of db-per-user is managing design docs in the face of >>>> a changing application, that is, how to distribute new design docs across >>>> 10s of 1000+s of user dbs? It’s not impossible, but tedious. In all >>>> scenarios above but scenario 3., we could simplify this significantly. Say >>>> an admin creates a design doc, and gives all users in the db access to >>>> this design doc (this could be with the _users role, or yet another new >>>> role _members, if we need it), requesting the result of a view defined in >>>> that design doc will produce an index that is powered by the requesting >>>> user’s by-access-seq index section(s). >>>> >>>> N.B., this would require us to change a fundamental assumption when doing >>>> the association between a design doc’s definition and index: normally, >>>> there is only the `views` member that is hashed and that hash is used as >>>> the index’s filename. Because there is only by-seq to power a view, that >>>> all works. But now that we have an arbitrary set of sections on >>>> by-access-seq, any view index built will have to take a user’s name and >>>> roles into account. When a user leaves a group, or gains a group, all >>>> indexes for that user will no longer be valid and need rebuilding. >>>> >>>> >>>> ### User >>>> >>>> In any of the scenarios above, but especially 3., there could be >>>> legitimate per-user design docs, so how should those be treated in an >>>> _access enabled database? >>>> >>>> The significant fields in a design doc are `views`, `validate_doc_update` >>>> and `filters` (I’ll skip over the deprecated _show, _list, and _update). >>>> >>>> The easiest to handle is a `filters`: if a user specifies a filter for a >>>> _changes request or replication that lives in a design doc they don’t have >>>> access to, they get an error, similar to if they specify a non-existent >>>> design doc, just with `unauthorized` instead of `not_found`. >>>> >>>> Next `views` is also not very hard to imagine working: just like globally >>>> defined views for that db, the index is built for each user based on the >>>> user’s name and roles. >>>> >>>> More troubling are `validate_doc_update` functions: One, they are already >>>> troubling in that they slow down any document updates. Two, if we now >>>> import an existing db-per-user scenario where each user has their own >>>> design docs, >>> >>> I can’t think of a db-per-user scenario where each user DB would have a >>> different validate_doc_update method? It would be the same method with >>> access to the user context, the DBs security setting and the document, so >>> it would act differently for different users, but using the same code. >> >> They wouldn’t be different, but if we were do replicate 1000 db-per-user >> design docs into a single database, as per today’s semantics, we’d have to >> run 1000 VDUs on each doc update. >> >>> >>>> how should we apply validate_doc_update functions? 10s of 1000s of VDUs >>>> are impractical to apply on each doc update, let alone just the management >>>> of VDUs that are active on a database. One option would be to ignore VDUs >>>> if they are not defined globally (say with a _members role). But >>>> especially in scenario 3. this becomes problematic, but even without that >>>> specific scenario, this violates the no surprises best practice. >>>> >>>> We could say: >>>> >>>> a) we don’t support scenario 3. >>> >>> +1, I think it would make our lives easier in general if we don’t recommend >>> to share the same CouchDB for multiple apps. At least I don’t see a reason >>> to do that at this point. >> >> I think I like this best, too, but I’d like to hear from others as well. >> >> >> Best >> Jan >> — >>> >>>> b) we find a complicated but efficient way to apply only those VDUs that >>>> are defined in design docs the writing user has access to plus any global >>>> ones (this would be neat but rather complicated and potentially still >>>> impractical from a performance perspective for N users). >>>> c) we could store all per-user design docs, but ignore them completely, >>>> VDUs, views and filters. >>>> >>>> I think I currently fall on the side of not supporting scenario 3. and >>>> asking folks who migrate db-per-user to de-duplicate design docs and keep >>>> them per-app. I believe that is a good trade-off between the most common >>>> scenarios for db-per-user while keeping the implementation manageable. >>>> Globally accessible design docs would show up in a user’s changes feed and >>>> would replicate down to say a PouchDB application which might be the >>>> exclusive user of those design docs. >>>> >>>> In practice this would mean, a document that has an _id that starts with >>>> _design/ will have to be produced by a database admin. Luckily, that’s >>>> already the case. We should just make sure that folks don’t give db-admin >>>> access to all users habitually. >>>> >>>> >>>> ## Read and Write Access >>>> >>>> Speaking of validate_doc_update, it is used for two things: checking >>>> document schema and doc update authorisation. >>>> >>>> Once we allow access to a document with an _access field, we need to >>>> decide what kind of access this gives to a doc: read-only or read-write >>>> (I’m not considering write-only because for anything but doc creations >>>> this is not useful as you need access to the current _rev). >>>> >>>> However, when we look at implementing an application on top of our >>>> existing API, it is already weird that read access can be controlled >>>> globally (or with _access on a per doc level), but write access requires >>>> writing JavaScript code. I think it would be a reasonable expectation for >>>> users to expect a per-doc read/write permission granting. >>> >>> Yes! >>> >>>> >>>> So we could have all of the above, but with two extra fields: _access_read >>>> and _access_write, or _access: {read: [], write: []} >>> >>> I prefer this API for its compactness, thinking about offline >>> synchronization. The smaller the docs, the better. >>> >>> Best >>> “Gregor” >>> — >>> >>> >>>> or we overload user and group names: _access: [user_a:read, user_b:write] >>>> (or any permutation thereof). Overloading can cause trouble with naturally >>>> occurring characters in group names. >>>> >>>> The former seems more explicit, but from an API perspective that’s a >>>> little more awkward: remember that we currently have an arbitrary limit of >>>> 10 members in a user’s role array, to avoid excessive fan out on >>>> cluster-internal operations. Partitioned dbs could get away with more, >>>> more easily however. If we allow the specification of access control in >>>> two lists, and one of the lists implies membership in the other, we have a >>>> total limit of 10 members across both arrays. Or we limit 5 + 5, but that >>>> seems excessive, while 10 total seems weird, but doable. Anyway, good >>>> bikeshed. >>>> >>>> >>>> * * * >>>> >>>> >>>> So far. I think all of the problems outlined are solvable, if with a clear >>>> definition of what use-cases we do not support with access. If you have >>>> more scenarios than the ones I outlined, please add them and we can see if >>>> they cause any additional trouble. >>>> >>>> Thanks for reading this far and I’m looking forward to your feedback. >>>> >>>> >>>> Best, >>>> Jan “_access” Lehnardt >>>> — >>>> >>>> >>>> >>>> >>>>> On 17. Feb 2019, at 15:25, Jan Lehnardt <j...@apache.org> wrote: >>>>> >>>>> Hi Everyone, >>>>> >>>>> I’m happy to share my work in progress attempt to implement the per-doc >>>>> access control feature we discussed a good while ago: >>>>> >>>>> https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E >>>>> >>>>> <https://lists.apache.org/thread.html/6aa77dd8e5974a3a540758c6902ccb509ab5a2e4802ecf4fd724a5e4@%3Cdev.couchdb.apache.org%3E> >>>>> >>>>> You can check out my branch here: >>>>> >>>>> https://github.com/apache/couchdb/compare/access?expand=1 >>>>> <https://github.com/apache/couchdb/compare/access?expand=1> >>>>> >>>>> It is very much work in progress, but it is far enough along to warrant >>>>> discussion. >>>>> >>>>> The main point of this branch is to show all the places that we would >>>>> need to change to support the proposal. >>>>> >>>>> Things I’ve left for later: >>>>> >>>>> - currently only the first element in the _access array is used. Our >>>>> and/or syntax can be added later. >>>>> - building per-access views has not been implemented yet, couch_index >>>>> would have to be taught about the new per-access-id index. >>>>> - pretty HTTP error handling >>>>> - tests except for a tiny shell script 😇 >>>>> >>>>> Implementation notes: >>>>> >>>>> You create a database with the _access feature turned on like so: PUT >>>>> /db?access=true >>>>> >>>>> I started out with storing _access in the document body, as that would >>>>> allow for a minimal change set, however, on doc updates, we try hard not >>>>> to load the old doc body from the database, and forcing us to do so for >>>>> EVERY doc update under _access seemed prohibitive, so I extended the >>>>> #doc, #doc_info and #full_doc_info records with a new `access` attribute >>>>> that is stored in both by-id and by-seq. I will need guidance on how >>>>> extending these records impact multi-version cluster interop. And >>>>> especially whether this is an acceptable approach. >>>>> >>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-904ab7473ff8ddd07ea44aca414e3a36 >>>>> >>>>> * * * >>>>> >>>>> The main addition is a new native query server called >>>>> couch_access_native_proc, which implements two new indexes by-access-id >>>>> and by-access-seq which do what you’d expect, pass in a userCtx and >>>>> retrieve the equivalent of _all_docs or _changes, but only including >>>>> those docs that match the username and roles in their _access property. >>>>> The existing handlers for _all_docs and _changes have been augmented to >>>>> use the new indexes instead of the default ones, unless the user is an >>>>> admin. >>>>> >>>>> https://github.com/apache/couchdb/compare/access?expand=1&ws=0#diff-fbb53323f07579be5e46ba63cb6701c4 >>>>> >>>>> * * * >>>>> >>>>> The rest of the diff is concerned with making document CRUD behave as >>>>> you’d expect it. See this little demonstration for what things look like: >>>>> >>>>> https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497 >>>>> <https://gist.github.com/janl/b6d3f7502aa20b7b9ab9d9dcb8e92497> (I’m just >>>>> noticing that there might be something wonky with DELETE, but you’ll get >>>>> the gist #rimshot) >>>>> >>>>> * * * >>>>> >>>>> Open questions: >>>>> >>>>> - The aim of this is to get as close to regular CouchDB behaviour as >>>>> possible. One thing that is new however which would require all apps to >>>>> be changed is that for an _access enabled database to include an _access >>>>> field in their docs (docs with no _access are admin-only for now). We >>>>> might want to consider on new document writes to auto-insert the >>>>> authenticated user’s name as the first element in the _access array, so >>>>> existing apps “just work”. >>>>> >>>>> - Interplay with partitioned dbs: eschewing db-per-user is already a >>>>> large boon if you have a lot of users, but making those per-user requests >>>>> inside an _access enabled database efficient would be doubly nice, so why >>>>> not use the username from the first question above and use that as the >>>>> partition key? This would work nicely for natural users with their own >>>>> docs that want to share them with others later, but I can easily imagine >>>>> a pipelined use of CouchDB, where a “collector” user creates all new >>>>> docs, an “analyser” takes them over and hand them to a “result” user for >>>>> viewing. In that case, we’d violate the high-cardinality rule of >>>>> partitions (have a lot of small ones), instead all docs go through all >>>>> three users. I’d be okay with treating the later scenario as a minor >>>>> use-case, but for that use-case, we should be able to disable >>>>> auto-partitioning on db creation. >>>>> >>>>> - building access view indexes for docs that have frequent _access >>>>> changes, lead to many orphaned view indexes, we should look at an >>>>> auto-cleanup solution here (maybe keep 1-N indexes in case folks just >>>>> swap back and forth). >>>>> >>>>> * * * >>>>> >>>>> I’ll leave this here for now, I’m sure there are a few more things to >>>>> consider. >>>>> >>>>> I’d love to hear any and all feedback you might have. Especially if >>>>> anything is unclear. >>>>> >>>>> Best >>>>> Jan >>>>> — >>>> >>>> -- >>>> Professional Support for Apache CouchDB: >>>> https://neighbourhood.ie/couchdb-support/ >>>> >>> >>> -- >>> Professional Support for Apache CouchDB: >>> https://neighbourhood.ie/couchdb-support/ >>> >> >> -- >> Professional Support for Apache CouchDB: >> https://neighbourhood.ie/couchdb-support/ >> <https://neighbourhood.ie/couchdb-support/> -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/