> On 27 Sep 2016, at 18:31, Johannes Jörg Schmidt <[email protected]> wrote: > > Woah, what an impressive list! > > For the validation part - why not somehow use JSON Schema[1]? I have used > it in several projects and it plays nicely with CouchDB documents. It > covers most common validation needs like requiring certain fields, enum > support, pattern matching etc.
This is best for when we have this in JIRA, but just a quick note: My thinking is to not have too man “languages” in CouchDB. I know I’m already bending this rule with suggesting JSON Pointers, but they could be snuck in as URL-JSON-paths without a huge learning curve. Hence my thought of using Mango for validation, because it’s just one thing to learn with queries. I’m happy to be convinced otherwise, too :) Best Jan -- > > Best, > Johannes > > [1] http://json-schema.org/ > > Am 27.09.2016 2:57 nachm. schrieb "Jan Lehnardt" <[email protected]>: > >> Hi all, >> >> apologies in advance, this is going to be a long email. >> >> >> I’ve been holding this back intentionally in order to be able to focus on >> shipping 2.0, but now that that’s out, I feel we should talk about what’s >> next. >> >> This email is separated into areas of work that I think CouchDB could >> improve on, some with very concrete plans, some with rather vague ideas. >> I’ve been collecting these over the past year or <strike>two</strike>five, >> so it’s fairly wide, but I’m sure I’m missing things that other people find >> important, so please add to this list. >> >> After the initial discussion here, I’ll move all of the individual issues >> to JIRA, so we can go down our usual process. >> >> This is basically my wish list, and I’d like this to become everyone’s >> wish list, so please add what I’ve been missing. :) — Note, this isn’t a >> free-for-all, only suggest things that you are prepared to see through >> being shipped, from design, implementation to docs. >> >> I don’t have a specific order for these in mind, although I have a rough >> idea of what we should be doing first. Putting all of this on a roadmap is >> going to be a fun future exercise for us, though :) >> >> One last note: this doesn’t include anything on documentation or testing. >> I fully expect to step our game from here on out. This list is for the >> technical aspects of the project. >> >> * * * >> >> These are the areas of work I’ve roughly come up with that my suggestions >> fit into: >> >> - API >> - Storage >> - Query >> - Replication >> - Cluster >> - Fauxton >> - Releases >> - Performance >> - Internals >> - Builds >> - Features >> >> (I’m not claiming these are any good, but it’s what I’ve got) >> >> >> Let’s go. >> >> >> * * * >> >> # API >> >> ## HTTP2 >> >> I think this is an obvious first next step. Our HTTP Layer needs work, our >> existing HTTP server library is not getting HTTP2 support, it’s time to >> attack this head-first. I’m imagining a Cowboy[1]-based HTTP layer that >> calls into a unified internals layer and everything will be rose-golden. >> HTTP2 support for Cowboy is still in progress. Maybe we can help them >> along, or we focus on the internals refactor first and drop Cowboy in later >> (not sure how feasible this approach is, but we’ll figure this out. >> >> In my head, we focus on this and call the result 3.0 in 6-12 months. That >> doesn’t mean we *only* do this, but this will be the focus (more on this >> later). >> >> There are a few fun considerations, mainly of the “avoid Python >> 2/3-chasm”-type. Do we re-implement the 2.0 API with all its >> idiosyncrasies, or do we take the opportunity to clean things up while we >> are at it? If yes, how and how long do we support the then old API? Do we >> manage this via different ports? If yes, how can this me made to work for >> hosting services like Cloudant? Etc. etc. >> >> [1] https://github.com/ninenines/cowboy >> >> >> ## Sub-Document Operations >> >> Currently a doc update needs the whole doc body sent to the server. There >> are some obvious performance improvements possible. For the longest time, I >> wanted to see if we can model sub-document operations via JSON Pointers[2]. >> These would roughly allow pointing to a JSON value via a URL. >> >> For example in this doc: >> >> { >> "_id": "123abc", >> "_rev": "zyx987", >> "contact": { >> "name": "", >> "address": { >> "street": "Long Street", >> "nr": 123 >> "zip": "12345" >> } >> } >> >> An update to the zip code could look like this: >> >> curl -X POST $SERVER/db/123abc/_jsonpointer/contact/address/zip?rev=zyx987 >> -d '54321' >> >> GET/DELETE accordingly. We could shortcut the `_jsonpointer` to just `_` >> if we like the short magic. >> >> JSONPointer can deal with nested objects and lists and works fairly well >> for this type of stuff, and it is rather simple to implement (even I could >> do it: https://github.com/janl/erl-jsonpointer/blob/master/src/ >> jsonpointer.erl — This idea is literally 5 years old, it looks like, no >> need to use my code if there is anything better). >> >> This is just a raw idea, and I’m happy to solve this any other way, if >> somebody has a good approach. >> >> [2] https://tools.ietf.org/html/rfc6901 >> >> >> ## HTTP PATCH / JSON Diff >> >> Another stab at a similar problem are HTTP PATCH with JSON Diff, but with >> the inherent problems of JSON normalisation, I’m leaning towards the >> JSONPointer variant as simpler, but I’d be open for this as well, if >> someone comes up with a good approach. >> >> >> ## GraphQL[3] >> >> It’s rather new, but getting good traction[4]. This would be a nice >> addition to our API. Somebody might already be hacking on this ;) >> >> [3]: http://graphql.org >> [4]: http://githubengineering.com/the-github-graphql-api/ >> >> >> ## Mango for Document Validation >> >> The only place where we absolutely require writing JS is >> validate_doc_update functions. Some security behaviour can only be enforced >> there. With their inherent performance problems, I’d like to get doc >> validations out of the path of the query server and would love to find a >> way to validate document updates through Mango. >> >> >> ## Redesign Security System >> >> Our security system is slowly grown and not coherently designed. We should >> start over. I have many ideas and opinions, but they are out of scope for >> this. I think everybody here agrees that we can do better. This *very >> likely* will *not* include per-document ACLs as per the often stated issues >> with that approach in our data model. >> >> * * * >> >> >> # Replication >> >> This is our flagship feature of course, and there are a few things we can >> do better. >> >> >> ## Mobile-optimised extension or new version of the protocol >> >> The original protocol design didn’t take mobile devices into account and >> through PouchDB et.al. we are now learning that there are number of >> downsides to our protocol. We’ve helped a lot with introducing >> _bulk_get/_revs, but that’s more a bandaid than a considered strategy ;) >> >> That new version could also be HTTP2-only, to take advantage of the new >> connection semantics there. >> >> >> ## Easy way to skip deletes on sync >> >> This one is self-explanatory, mobile clients usually don’t need to sync >> deletes from a year ago first. Mango filters might already get us there, >> maybe we can do better. >> >> >> ## Sync a rolling subset >> >> Say you always want to keep the last 90 days of email on a mobile device >> with optionally back-loading older documents on user-request. It is >> something I could see getting a lot of traction. >> >> Today, this can be built on 1.x with clever use of _purge, but that’s >> hardly a good experience. I don’t know if it can be done in a cluster. >> >> >> ## Selective Sync >> >> There might be other criteria than “last 90 days”, so the more general >> solution to this problem class would be arbitrary (e.g. client-directed) >> selective sync, but this might be really hard as opposed to just very hard >> of the “last 90 days” one, so happy to punt on this first. But filters are >> generally not the answer, especially with large data sets. Maybe proper >> sync from views _changes is the answer. >> >> >> ## A _db_updates powered _replicator DB >> >> Running thousands+ of replications on a server is not really resource >> friendly today, we should teach the replicator to only run replication on >> active databases via _db_updates. Somebody might already be looking into >> this one. >> >> * * * >> >> >> # Storage >> >> >> ## Pluggable Storage Engines >> >> Paul Davis already showed some work on allowing multiple different storage >> backends. I’d like to see this land. >> >> ## Different Storage Backends >> >> These don’t all have to be supported by the main project, but I’d really >> like to see some experimentation with different backends like >> LevelDB[5]/RocksDB[6], InnoDB[7], SQLite[8] a native-erlang one that is >> optimised for space usage and not performance (I don’t want to budge on >> safety). Similarly, it’d be fun to see if there is a compression format >> that we can use as a storage backend directly, so we get full-DB >> compression as opposed to just per-doc compression. >> >> [5]: http://leveldb.org >> [6]: http://rocksdb.org >> [7]: https://en.wikipedia.org/wiki/InnoDB >> [8]: https://www.sqlite.org >> >> * * * >> >> >> # Query >> >> ## Teach Mango JOINs and result sorting >> >> It’s the natural path for query languages. We should make these happen. >> Once we have the basics, we might even be able to find a way to compile >> basic SQL into Mango, it’s going to be glorious :) >> >> >> ## “No-JavaScript”-mode >> >> I’ve hinted at this above, but I’d really like a way for users to use >> CouchDB productively without having to write a line of JavaScript. My main >> motivation is the poor performance characteristics of the Query Server >> (hello CGI[9]?). But even with one that is improved, it will always faster >> to do any, say filtering or validation operations in native Erlang. I don’t >> know if we can expand Mango to cover all this, and I’m not really concerned >> about the specifics, as long as we get there. >> >> Of course, for pro-users, the JS-variant will still be around. >> >> [9]: https://en.wikipedia.org/wiki/Common_Gateway_Interface >> >> >> ## Query Server V2 >> >> We need to revamp the Query Server. It is hardcoded to an out-of-date >> version of SpiderMonkey and we are stuck with C-bindings that barely anyone >> dares to look at, let alone iterate on. >> >> I believe the way forward is re-vamping the query server protocol to use >> streaming IO instead of blocking batches like we do now, and use JS-native >> implementation of the JS-side instead of C-bindings. >> >> I’m partial to doing this straight in Node, because there is a ton of >> support for things we need already, and I believe we’ve solved the >> isolation issues required for secure MapReduce, but I’m happy to use any >> other thing as well, if it helps. >> >> Other benefits would be support for emerging JS features that devs will >> want to use. >> >> And we can have two modes: standalone QS like now, and embedded QS where, >> say, V8 is compiled into the Erlang VM. Not everybody will want to run >> this, but it’ll be neat for those who do. >> >> >> * * * >> >> >> # Cluster >> >> ## Rebalancing >> >> With this we will be able to grow clusters one by one instead of hitting a >> wall when eventually each shard lives on a single machine. E.g. when you >> add a node to the cluster, all other nodes share 1/Nth of their data with >> the new node, and everything can keep going. Same for removing a node and >> shrinking the cluster. >> >> Couchbase has this and it is really nice. >> >> >> ## Setup >> >> Even without rebalancing, we need a nice Fauxton UI to manage the cluster, >> so far we only have a simple setup procedure (which is great don’t get me >> wrong), but users will want to do more elaborate cluster management and we >> should make that easy with a slick UI. >> >> >> ## Cluster-Aware Clients >> >> This might end up being not a good idea, but I’d like some experimentation >> here. Say you’d have a CouchDB client that could be hooked into the cluster >> topology so it’d know which nodes to query for which data, then we can save >> a proxy-hop, and build clients that have lower-latency access to CouchDB. >> Again, this is something that Couchbase does and I think is worth exploring. >> >> >> >> * * * >> >> >> # Fauxton >> >> Fauxton is great, but it could be better too, I think. I’m mostly >> concerned about number of clicks/taps required for more specialised actions >> (like setting the group_level of a reduce query, it’s like 15 or so). More >> cluster info would also be nice, and maybe a specialised dashboard for >> db-per-user setups. >> >> >> * * * >> >> >> # Releases >> >> >> ## Six-Week Release Trains >> >> We need to get back to frequent releases and I propose to go back to our >> six-week-release train plans from three years ago. Whatever lands within a >> release train time frame goes out. The nature of the change dictates the >> version number increment as per semver, and we just ship a new version >> every six weeks, even if it only includes a single bug fix. We should >> automate most of this infrastructure, so actual releases are cheap. We are >> reasonably close with this, but we need some more folks to step up on using >> and maintaining our CI systems. >> >> >> ## One major feature per major version >> >> I also propose to keep the scope of future major versions small, so we >> don’t have to wait another 3-5 years for 3.0. In particular, I think we >> should focus on a single major feature per major version and get that >> shipped within 6-12 months tops. If anything needs more time, it needs to >> be broken up. Of course we continue to add features and fix things while >> this happens, but as a project, there is *one* major feature we push. For >> example, for 3.0 I see our push be behind HTTP2 support. There is a lot of >> subsequent work required to make that happen, so it’ll be a worthwhile 3.0, >> but we can ship it in 6-12 months (hopefully). >> >> Best case scenario, we have CouchDB 4.0 coming out 12 months from now with >> two new major features. That would be amazing. >> >> >> * * * >> >> >> # Performance >> >> ## Perf Team >> >> We need a team to comprehensive look at CouchDB performance. There is a >> lot of low-hanging fruit like Robert Kowalski showed a while back, we >> should get back into this. I’m mostly inspired by SQLite who’ve done a >> release a while back that only focussed on 1-2% performance improvements, >> but got like 20-30 of those and made the thing a lot faster across the >> board. I can’t remember where I read about this, but I’ll update this once >> I find the link. >> >> >> ## Benchmark Suite >> >> We need a benchmark suite that tests a variety of different work loads. >> The goal here is to run different versions of CouchDB against the same >> suite on the same hardware, to see where are going. I’m imagining a >> http://arewefastyet.com style dashboard where we can track this, and even >> run this on Pull Requests and not allow them if they significantly impact >> performance. >> >> >> ## Synthetic Load Suite >> >> This one is for end users. I’d like to be able to say: My app produces >> mostly 10-20kb-sized docs, but millions of those in a single database, or >> across 1000s of databases, with these views etc. and then run this on >> target hardware so I’d know, e.g. how many nodes I need for a cluster with >> my estimated workload. I know this can only be done in approximation, but I >> think this could make a big difference in CouchDB adoption and feed back >> into Perf Team mentioned above. >> >> * * * >> >> >> # Internals >> >> ## Consolidate Repositories >> >> With 2.0 we started to experiment with radically small modules for our >> components and I think we’ve come to the conclusion that some consolidation >> is better for us going forward. Obvious candidates for separate repos are >> docs, Fauxton etc. but also some of the Erlang modules that other projects >> reasonably would use. >> >> >> ## Elixir >> >> I’d like it very much if we elevate Elixir as a prime target language for >> writing CouchDB internals. I believe this would get us an influx of new >> developers that we badly need to get all the things I’m listing here done. >> Somebody might be looking into the technical aspects of this already, but >> we need to decide as a project if we are okay with that. >> >> >> ## GitHub Issues >> >> I hope we can transition to GitHub Issues soon. >> >> * * * >> >> >> # Builds >> >> I’d like automated builds for source, Docker et.al., rpm, deb, brew, >> ports, Mac Binary, etc with proper release channels for people to subscribe >> to, all powered by CI for nightly builds, so people can test in-development >> versions easily. >> >> I’d also like builds that include popular community plugins like Geo or >> Fulltext Search. >> >> >> >> * * * >> >> >> # Features >> >> ## Better Support for db-per-user >> >> I don’t know what this will look like, but this is a pattern, and we need >> to support it better. >> >> One approach could be “virtual dbs” that are backed by a single database, >> but that’s usually at odds with views, so we could make this an XOR and >> disable views on these dbs. Since this usually powers client-heavy apps, >> querying usually happens there anyway. >> >> Another approach would be better / easier cross-db aggregation or >> querying. There are a few approaches, but nothing really slick. >> >> >> ## Schema Extraction >> >> I have half an (old) patch that extracts top level fields from a document >> and stores them with a hash in an “attachment” to the database header. So >> we only end up storing doc values and the schema hash. First of all this >> trades storage for CPU time (I haven’t measured anything yet), but more >> interestingly, we could use that schema data to do smart things like >> auto-generating a validation function / mango expression based on the data >> that is already in the database. And other fun things like easier schema >> migration operations that are native in CouchDB and thus a lot faster than >> external ones. For the curious ones, I’ve got the idea from V8’s property >> access optimisation strategy[10]. >> >> [10]: https://github.com/v8/v8/wiki/Design%20Elements#fast-property-access >> >> * * * >> >> Alright, that’s it for now. Can’t wait for your feedback! >> >> Best >> Jan >> -- >> Professional Support for Apache CouchDB: >> https://neighbourhood.ie/couchdb-support/ >> >> -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/
