Re: CouchDB Next

Johs Ensby Tue, 27 Sep 2016 20:44:11 -0700
Congratulations with the 2.0 relase Jan & dev team!

I will try to start a discussion on marketing strategy for 3.0 on the marketing 
list.
CouchDB is clearly a developer centric project and has to be (individual 
motivation is the driving force), but I think a marketing oriented persepective 
could be useful, especially when setting goals for the next major release.
br
Johs
 
> On 27. sep. 2016, at 14.56, Jan Lehnardt <[email protected]> wrote:
> 
> Hi all,
> 
> apologies in advance, this is going to be a long email.
> 
> 
> I’ve been holding this back intentionally in order to be able to focus on 
> shipping 2.0, but now that that’s out, I feel we should talk about what’s 
> next.
> 
> This email is separated into areas of work that I think CouchDB could improve 
> on, some with very concrete plans, some with rather vague ideas. I’ve been 
> collecting these over the past year or <strike>two</strike>five, so it’s 
> fairly wide, but I’m sure I’m missing things that other people find 
> important, so please add to this list.
> 
> After the initial discussion here, I’ll move all of the individual issues to 
> JIRA, so we can go down our usual process.
> 
> This is basically my wish list, and I’d like this to become everyone’s wish 
> list, so please add what I’ve been missing. :) — Note, this isn’t a 
> free-for-all, only suggest things that you are prepared to see through being 
> shipped, from design, implementation to docs.
> 
> I don’t have a specific order for these in mind, although I have a rough idea 
> of what we should be doing first. Putting all of this on a roadmap is going 
> to be a fun future exercise for us, though :)
> 
> One last note: this doesn’t include anything on documentation or testing. I 
> fully expect to step our game from here on out. This list is for the 
> technical aspects of the project.
> 
> * * *
> 
> These are the areas of work I’ve roughly come up with that my suggestions fit 
> into:
> 
> - API
> - Storage
> - Query
> - Replication
> - Cluster
> - Fauxton
> - Releases
> - Performance
> - Internals
> - Builds
> - Features
> 
> (I’m not claiming these are any good, but it’s what I’ve got)
> 
> 
> Let’s go.
> 
> 
> * * *
> 
> # API
> 
> ## HTTP2
> 
> I think this is an obvious first next step. Our HTTP Layer needs work, our 
> existing HTTP server library is not getting HTTP2 support, it’s time to 
> attack this head-first. I’m imagining a Cowboy[1]-based HTTP layer that calls 
> into a unified internals layer and everything will be rose-golden. HTTP2 
> support for Cowboy is still in progress. Maybe we can help them along, or we 
> focus on the internals refactor first and drop Cowboy in later (not sure how 
> feasible this approach is, but we’ll figure this out.
> 
> In my head, we focus on this and call the result 3.0 in 6-12 months. That 
> doesn’t mean we *only* do this, but this will be the focus (more on this 
> later).
> 
> There are a few fun considerations, mainly of the “avoid Python 
> 2/3-chasm”-type. Do we re-implement the 2.0 API with all its idiosyncrasies, 
> or do we take the opportunity to clean things up while we are at it? If yes, 
> how and how long do we support the then old API? Do we manage this via 
> different ports? If yes, how can this me made to work for hosting services 
> like Cloudant? Etc. etc.
> 
> [1] https://github.com/ninenines/cowboy
> 
> 
> ## Sub-Document Operations
> 
> Currently a doc update needs the whole doc body sent to the server. There are 
> some obvious performance improvements possible. For the longest time, I 
> wanted to see if we can model sub-document operations via JSON Pointers[2]. 
> These would roughly allow pointing to a JSON value via a URL.
> 
> For example in this doc:
> 
> {
>  "_id": "123abc",
>  "_rev": "zyx987",
>  "contact": {
>    "name": "",
>    "address": {
>      "street": "Long Street",
>      "nr": 123
>      "zip": "12345"
>    }
> }
> 
> An update to the zip code could look like this:
> 
> curl -X POST $SERVER/db/123abc/_jsonpointer/contact/address/zip?rev=zyx987 -d 
> '54321'
> 
> GET/DELETE accordingly. We could shortcut the `_jsonpointer` to just `_` if 
> we like the short magic.
> 
> JSONPointer can deal with nested objects and lists and works fairly well for 
> this type of stuff, and it is rather simple to implement (even I could do it: 
> https://github.com/janl/erl-jsonpointer/blob/master/src/jsonpointer.erl — 
> This idea is literally 5 years old, it looks like, no need to use my code if 
> there is anything better).
> 
> This is just a raw idea, and I’m happy to solve this any other way, if 
> somebody has a good approach.
> 
> [2] https://tools.ietf.org/html/rfc6901
> 
> 
> ## HTTP PATCH / JSON Diff
> 
> Another stab at a similar problem are HTTP PATCH with JSON Diff, but with the 
> inherent problems of JSON normalisation, I’m leaning towards the JSONPointer 
> variant as simpler, but I’d be open for this as well, if someone comes up 
> with a good approach.
> 
> 
> ## GraphQL[3]
> 
> It’s rather new, but getting good traction[4]. This would be a nice addition 
> to our API. Somebody might already be hacking on this ;)
> 
> [3]: http://graphql.org
> [4]: http://githubengineering.com/the-github-graphql-api/
> 
> 
> ## Mango for Document Validation
> 
> The only place where we absolutely require writing JS is validate_doc_update 
> functions. Some security behaviour can only be enforced there. With their 
> inherent performance problems, I’d like to get doc validations out of the 
> path of the query server and would love to find a way to validate document 
> updates through Mango.
> 
> 
> ## Redesign Security System
> 
> Our security system is slowly grown and not coherently designed. We should 
> start over. I have many ideas and opinions, but they are out of scope for 
> this. I think everybody here agrees that we can do better. This *very likely* 
> will *not* include per-document ACLs as per the often stated issues with that 
> approach in our data model.
> 
> * * *
> 
> 
> # Replication
> 
> This is our flagship feature of course, and there are a few things we can do 
> better.
> 
> 
> ## Mobile-optimised extension or new version of the protocol
> 
> The original protocol design didn’t take mobile devices into account and 
> through PouchDB et.al. we are now learning that there are number of downsides 
> to our protocol. We’ve helped a lot with introducing _bulk_get/_revs, but 
> that’s more a bandaid than a considered strategy ;)
> 
> That new version could also be HTTP2-only, to take advantage of the new 
> connection semantics there.
> 
> 
> ## Easy way to skip deletes on sync
> 
> This one is self-explanatory, mobile clients usually don’t need to sync 
> deletes from a year ago first. Mango filters might already get us there, 
> maybe we can do better.
> 
> 
> ## Sync a rolling subset
> 
> Say you always want to keep the last 90 days of email on a mobile device with 
> optionally back-loading older documents on user-request. It is something I 
> could see getting a lot of traction.
> 
> Today, this can be built on 1.x with clever use of _purge, but that’s hardly 
> a good experience. I don’t know if it can be done in a cluster.
> 
> 
> ## Selective Sync
> 
> There might be other criteria than “last 90 days”, so the more general 
> solution to this problem class would be arbitrary (e.g. client-directed) 
> selective sync, but this might be really hard as opposed to just very hard of 
> the “last 90 days” one, so happy to punt on this first. But filters are 
> generally not the answer, especially with large data sets. Maybe proper sync 
> from views _changes is the answer.
> 
> 
> ## A _db_updates powered _replicator DB
> 
> Running thousands+ of replications on a server is not really resource 
> friendly today, we should teach the replicator to only run replication on 
> active databases via _db_updates. Somebody might already be looking into this 
> one.
> 
> * * *
> 
> 
> # Storage
> 
> 
> ## Pluggable Storage Engines
> 
> Paul Davis already showed some work on allowing multiple different storage 
> backends. I’d like to see this land.
> 
> ## Different Storage Backends
> 
> These don’t all have to be supported by the main project, but I’d really like 
> to see some experimentation with different backends like 
> LevelDB[5]/RocksDB[6], InnoDB[7], SQLite[8] a native-erlang one that is 
> optimised for space usage and not performance (I don’t want to budge on 
> safety). Similarly, it’d be fun to see if there is a compression format that 
> we can use as a storage backend directly, so we get full-DB compression as 
> opposed to just per-doc compression.
> 
> [5]: http://leveldb.org
> [6]: http://rocksdb.org
> [7]: https://en.wikipedia.org/wiki/InnoDB
> [8]: https://www.sqlite.org
> 
> * * *
> 
> 
> # Query
> 
> ## Teach Mango JOINs and result sorting
> 
> It’s the natural path for query languages. We should make these happen. Once 
> we have the basics, we might even be able to find a way to compile basic SQL 
> into Mango, it’s going to be glorious :)
> 
> 
> ## “No-JavaScript”-mode
> 
> I’ve hinted at this above, but I’d really like a way for users to use CouchDB 
> productively without having to write a line of JavaScript. My main motivation 
> is the poor performance characteristics of the Query Server (hello CGI[9]?). 
> But even with one that is improved, it will always faster to do any, say 
> filtering or validation operations in native Erlang. I don’t know if we can 
> expand Mango to cover all this, and I’m not really concerned about the 
> specifics, as long as we get there.
> 
> Of course, for pro-users, the JS-variant will still be around.
> 
> [9]: https://en.wikipedia.org/wiki/Common_Gateway_Interface
> 
> 
> ## Query Server V2
> 
> We need to revamp the Query Server. It is hardcoded to an out-of-date version 
> of SpiderMonkey and we are stuck with C-bindings that barely anyone dares to 
> look at, let alone iterate on. 
> 
> I believe the way forward is re-vamping the query server protocol to use 
> streaming IO instead of blocking batches like we do now, and use JS-native 
> implementation of the JS-side instead of C-bindings.
> 
> I’m partial to doing this straight in Node, because there is a ton of support 
> for things we need already, and I believe we’ve solved the isolation issues 
> required for secure MapReduce, but I’m happy to use any other thing as well, 
> if it helps.
> 
> Other benefits would be support for emerging JS features that devs will want 
> to use.
> 
> And we can have two modes: standalone QS like now, and embedded QS where, 
> say, V8 is compiled into the Erlang VM. Not everybody will want to run this, 
> but it’ll be neat for those who do.
> 
> 
> * * *
> 
> 
> # Cluster
> 
> ## Rebalancing
> 
> With this we will be able to grow clusters one by one instead of hitting a 
> wall when eventually each shard lives on a single machine. E.g. when you add 
> a node to the cluster, all other nodes share 1/Nth of their data with the new 
> node, and everything can keep going. Same for removing a node and shrinking 
> the cluster.
> 
> Couchbase has this and it is really nice.
> 
> 
> ## Setup
> 
> Even without rebalancing, we need a nice Fauxton UI to manage the cluster, so 
> far we only have a simple setup procedure (which is great don’t get me 
> wrong), but users will want to do more elaborate cluster management and we 
> should make that easy with a slick UI.
> 
> 
> ## Cluster-Aware Clients
> 
> This might end up being not a good idea, but I’d like some experimentation 
> here. Say you’d have a CouchDB client that could be hooked into the cluster 
> topology so it’d know which nodes to query for which data, then we can save a 
> proxy-hop, and build clients that have lower-latency access to CouchDB. 
> Again, this is something that Couchbase does and I think is worth exploring.
> 
> 
> 
> * * *
> 
> 
> # Fauxton
> 
> Fauxton is great, but it could be better too, I think. I’m mostly concerned 
> about number of clicks/taps required for more specialised actions (like 
> setting the group_level of a reduce query, it’s like 15 or so). More cluster 
> info would also be nice, and maybe a specialised dashboard for db-per-user 
> setups.
> 
> 
> * * *
> 
> 
> # Releases
> 
> 
> ## Six-Week Release Trains
> 
> We need to get back to frequent releases and I propose to go back to our 
> six-week-release train plans from three years ago. Whatever lands within a 
> release train time frame goes out. The nature of the change dictates the 
> version number increment as per semver, and we just ship a new version every 
> six weeks, even if it only includes a single bug fix. We should automate most 
> of this infrastructure, so actual releases are cheap. We are reasonably close 
> with this, but we need some more folks to step up on using and maintaining 
> our CI systems.
> 
> 
> ## One major feature per major version
> 
> I also propose to keep the scope of future major versions small, so we don’t 
> have to wait another 3-5 years for 3.0. In particular, I think we should 
> focus on a single major feature per major version and get that shipped within 
> 6-12 months tops. If anything needs more time, it needs to be broken up. Of 
> course we continue to add features and fix things while this happens, but as 
> a project, there is *one* major feature we push. For example, for 3.0 I see 
> our push be behind HTTP2 support. There is a lot of subsequent work required 
> to make that happen, so it’ll be a worthwhile 3.0, but we can ship it in 6-12 
> months (hopefully).
> 
> Best case scenario, we have CouchDB 4.0 coming out 12 months from now with 
> two new major features. That would be amazing.
> 
> 
> * * *
> 
> 
> # Performance
> 
> ## Perf Team
> 
> We need a team to comprehensive look at CouchDB performance. There is a lot 
> of low-hanging fruit like Robert Kowalski showed a while back, we should get 
> back into this. I’m mostly inspired by SQLite who’ve done a release a while 
> back that only focussed on 1-2% performance improvements, but got like 20-30 
> of those and made the thing a lot faster across the board. I can’t remember 
> where I read about this, but I’ll update this once I find the link.
> 
> 
> ## Benchmark Suite
> 
> We need a benchmark suite that tests a variety of different work loads. The 
> goal here is to run different versions of CouchDB against the same suite on 
> the same hardware, to see where are going. I’m imagining a 
> http://arewefastyet.com style dashboard where we can track this, and even run 
> this on Pull Requests and not allow them if they significantly impact 
> performance.
> 
> 
> ## Synthetic Load Suite
> 
> This one is for end users. I’d like to be able to say: My app produces mostly 
> 10-20kb-sized docs, but millions of those in a single database, or across 
> 1000s of databases, with these views etc. and then run this on target 
> hardware so I’d know, e.g. how many nodes I need for a cluster with my 
> estimated workload. I know this can only be done in approximation, but I 
> think this could make a big difference in CouchDB adoption and feed back into 
> Perf Team mentioned above.
> 
> * * *
> 
> 
> # Internals
> 
> ## Consolidate Repositories
> 
> With 2.0 we started to experiment with radically small modules for our 
> components and I think we’ve come to the conclusion that some consolidation 
> is better for us going forward. Obvious candidates for separate repos are 
> docs, Fauxton etc. but also some of the Erlang modules that other projects 
> reasonably would use.
> 
> 
> ## Elixir
> 
> I’d like it very much if we elevate Elixir as a prime target language for 
> writing CouchDB internals. I believe this would get us an influx of new 
> developers that we badly need to get all the things I’m listing here done. 
> Somebody might be looking into the technical aspects of this already, but we 
> need to decide as a project if we are okay with that.
> 
> 
> ## GitHub Issues
> 
> I hope we can transition to GitHub Issues soon.
> 
> * * *
> 
> 
> # Builds
> 
> I’d like automated builds for source, Docker et.al., rpm, deb, brew, ports, 
> Mac Binary, etc with proper release channels for people to subscribe to, all 
> powered by CI for nightly builds, so people can test in-development versions 
> easily.
> 
> I’d also like builds that include popular community plugins like Geo or 
> Fulltext Search.
> 
> 
> 
> * * *
> 
> 
> # Features
> 
> ## Better Support for db-per-user
> 
> I don’t know what this will look like, but this is a pattern, and we need to 
> support it better.
> 
> One approach could be “virtual dbs” that are backed by a single database, but 
> that’s usually at odds with views, so we could make this an XOR and disable 
> views on these dbs. Since this usually powers client-heavy apps, querying 
> usually happens there anyway.
> 
> Another approach would be better / easier cross-db aggregation or querying. 
> There are a few approaches, but nothing really slick.
> 
> 
> ## Schema Extraction
> 
> I have half an (old) patch that extracts top level fields from a document and 
> stores them with a hash in an “attachment” to the database header. So we only 
> end up storing doc values and the schema hash. First of all this trades 
> storage for CPU time (I haven’t measured anything yet), but more 
> interestingly, we could use that schema data to do smart things like 
> auto-generating a validation function / mango expression based on the data 
> that is already in the database. And other fun things like easier schema 
> migration operations that are native in CouchDB and thus a lot faster than 
> external ones. For the curious ones, I’ve got the idea from V8’s property 
> access optimisation strategy[10].
> 
> [10]: https://github.com/v8/v8/wiki/Design%20Elements#fast-property-access
> 
> * * *
> 
> Alright, that’s it for now. Can’t wait for your feedback!
> 
> Best
> Jan
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>
Re: CouchDB Next

Reply via email to