Congratulations with the 2.0 relase Jan & dev team!
I will try to start a discussion on marketing strategy for 3.0 on the marketing
list.
CouchDB is clearly a developer centric project and has to be (individual
motivation is the driving force), but I think a marketing oriented persepective
could be useful, especially when setting goals for the next major release.
br
Johs
> On 27. sep. 2016, at 14.56, Jan Lehnardt <[email protected]> wrote:
>
> Hi all,
>
> apologies in advance, this is going to be a long email.
>
>
> I’ve been holding this back intentionally in order to be able to focus on
> shipping 2.0, but now that that’s out, I feel we should talk about what’s
> next.
>
> This email is separated into areas of work that I think CouchDB could improve
> on, some with very concrete plans, some with rather vague ideas. I’ve been
> collecting these over the past year or <strike>two</strike>five, so it’s
> fairly wide, but I’m sure I’m missing things that other people find
> important, so please add to this list.
>
> After the initial discussion here, I’ll move all of the individual issues to
> JIRA, so we can go down our usual process.
>
> This is basically my wish list, and I’d like this to become everyone’s wish
> list, so please add what I’ve been missing. :) — Note, this isn’t a
> free-for-all, only suggest things that you are prepared to see through being
> shipped, from design, implementation to docs.
>
> I don’t have a specific order for these in mind, although I have a rough idea
> of what we should be doing first. Putting all of this on a roadmap is going
> to be a fun future exercise for us, though :)
>
> One last note: this doesn’t include anything on documentation or testing. I
> fully expect to step our game from here on out. This list is for the
> technical aspects of the project.
>
> * * *
>
> These are the areas of work I’ve roughly come up with that my suggestions fit
> into:
>
> - API
> - Storage
> - Query
> - Replication
> - Cluster
> - Fauxton
> - Releases
> - Performance
> - Internals
> - Builds
> - Features
>
> (I’m not claiming these are any good, but it’s what I’ve got)
>
>
> Let’s go.
>
>
> * * *
>
> # API
>
> ## HTTP2
>
> I think this is an obvious first next step. Our HTTP Layer needs work, our
> existing HTTP server library is not getting HTTP2 support, it’s time to
> attack this head-first. I’m imagining a Cowboy[1]-based HTTP layer that calls
> into a unified internals layer and everything will be rose-golden. HTTP2
> support for Cowboy is still in progress. Maybe we can help them along, or we
> focus on the internals refactor first and drop Cowboy in later (not sure how
> feasible this approach is, but we’ll figure this out.
>
> In my head, we focus on this and call the result 3.0 in 6-12 months. That
> doesn’t mean we *only* do this, but this will be the focus (more on this
> later).
>
> There are a few fun considerations, mainly of the “avoid Python
> 2/3-chasm”-type. Do we re-implement the 2.0 API with all its idiosyncrasies,
> or do we take the opportunity to clean things up while we are at it? If yes,
> how and how long do we support the then old API? Do we manage this via
> different ports? If yes, how can this me made to work for hosting services
> like Cloudant? Etc. etc.
>
> [1] https://github.com/ninenines/cowboy
>
>
> ## Sub-Document Operations
>
> Currently a doc update needs the whole doc body sent to the server. There are
> some obvious performance improvements possible. For the longest time, I
> wanted to see if we can model sub-document operations via JSON Pointers[2].
> These would roughly allow pointing to a JSON value via a URL.
>
> For example in this doc:
>
> {
> "_id": "123abc",
> "_rev": "zyx987",
> "contact": {
> "name": "",
> "address": {
> "street": "Long Street",
> "nr": 123
> "zip": "12345"
> }
> }
>
> An update to the zip code could look like this:
>
> curl -X POST $SERVER/db/123abc/_jsonpointer/contact/address/zip?rev=zyx987 -d
> '54321'
>
> GET/DELETE accordingly. We could shortcut the `_jsonpointer` to just `_` if
> we like the short magic.
>
> JSONPointer can deal with nested objects and lists and works fairly well for
> this type of stuff, and it is rather simple to implement (even I could do it:
> https://github.com/janl/erl-jsonpointer/blob/master/src/jsonpointer.erl —
> This idea is literally 5 years old, it looks like, no need to use my code if
> there is anything better).
>
> This is just a raw idea, and I’m happy to solve this any other way, if
> somebody has a good approach.
>
> [2] https://tools.ietf.org/html/rfc6901
>
>
> ## HTTP PATCH / JSON Diff
>
> Another stab at a similar problem are HTTP PATCH with JSON Diff, but with the
> inherent problems of JSON normalisation, I’m leaning towards the JSONPointer
> variant as simpler, but I’d be open for this as well, if someone comes up
> with a good approach.
>
>
> ## GraphQL[3]
>
> It’s rather new, but getting good traction[4]. This would be a nice addition
> to our API. Somebody might already be hacking on this ;)
>
> [3]: http://graphql.org
> [4]: http://githubengineering.com/the-github-graphql-api/
>
>
> ## Mango for Document Validation
>
> The only place where we absolutely require writing JS is validate_doc_update
> functions. Some security behaviour can only be enforced there. With their
> inherent performance problems, I’d like to get doc validations out of the
> path of the query server and would love to find a way to validate document
> updates through Mango.
>
>
> ## Redesign Security System
>
> Our security system is slowly grown and not coherently designed. We should
> start over. I have many ideas and opinions, but they are out of scope for
> this. I think everybody here agrees that we can do better. This *very likely*
> will *not* include per-document ACLs as per the often stated issues with that
> approach in our data model.
>
> * * *
>
>
> # Replication
>
> This is our flagship feature of course, and there are a few things we can do
> better.
>
>
> ## Mobile-optimised extension or new version of the protocol
>
> The original protocol design didn’t take mobile devices into account and
> through PouchDB et.al. we are now learning that there are number of downsides
> to our protocol. We’ve helped a lot with introducing _bulk_get/_revs, but
> that’s more a bandaid than a considered strategy ;)
>
> That new version could also be HTTP2-only, to take advantage of the new
> connection semantics there.
>
>
> ## Easy way to skip deletes on sync
>
> This one is self-explanatory, mobile clients usually don’t need to sync
> deletes from a year ago first. Mango filters might already get us there,
> maybe we can do better.
>
>
> ## Sync a rolling subset
>
> Say you always want to keep the last 90 days of email on a mobile device with
> optionally back-loading older documents on user-request. It is something I
> could see getting a lot of traction.
>
> Today, this can be built on 1.x with clever use of _purge, but that’s hardly
> a good experience. I don’t know if it can be done in a cluster.
>
>
> ## Selective Sync
>
> There might be other criteria than “last 90 days”, so the more general
> solution to this problem class would be arbitrary (e.g. client-directed)
> selective sync, but this might be really hard as opposed to just very hard of
> the “last 90 days” one, so happy to punt on this first. But filters are
> generally not the answer, especially with large data sets. Maybe proper sync
> from views _changes is the answer.
>
>
> ## A _db_updates powered _replicator DB
>
> Running thousands+ of replications on a server is not really resource
> friendly today, we should teach the replicator to only run replication on
> active databases via _db_updates. Somebody might already be looking into this
> one.
>
> * * *
>
>
> # Storage
>
>
> ## Pluggable Storage Engines
>
> Paul Davis already showed some work on allowing multiple different storage
> backends. I’d like to see this land.
>
> ## Different Storage Backends
>
> These don’t all have to be supported by the main project, but I’d really like
> to see some experimentation with different backends like
> LevelDB[5]/RocksDB[6], InnoDB[7], SQLite[8] a native-erlang one that is
> optimised for space usage and not performance (I don’t want to budge on
> safety). Similarly, it’d be fun to see if there is a compression format that
> we can use as a storage backend directly, so we get full-DB compression as
> opposed to just per-doc compression.
>
> [5]: http://leveldb.org
> [6]: http://rocksdb.org
> [7]: https://en.wikipedia.org/wiki/InnoDB
> [8]: https://www.sqlite.org
>
> * * *
>
>
> # Query
>
> ## Teach Mango JOINs and result sorting
>
> It’s the natural path for query languages. We should make these happen. Once
> we have the basics, we might even be able to find a way to compile basic SQL
> into Mango, it’s going to be glorious :)
>
>
> ## “No-JavaScript”-mode
>
> I’ve hinted at this above, but I’d really like a way for users to use CouchDB
> productively without having to write a line of JavaScript. My main motivation
> is the poor performance characteristics of the Query Server (hello CGI[9]?).
> But even with one that is improved, it will always faster to do any, say
> filtering or validation operations in native Erlang. I don’t know if we can
> expand Mango to cover all this, and I’m not really concerned about the
> specifics, as long as we get there.
>
> Of course, for pro-users, the JS-variant will still be around.
>
> [9]: https://en.wikipedia.org/wiki/Common_Gateway_Interface
>
>
> ## Query Server V2
>
> We need to revamp the Query Server. It is hardcoded to an out-of-date version
> of SpiderMonkey and we are stuck with C-bindings that barely anyone dares to
> look at, let alone iterate on.
>
> I believe the way forward is re-vamping the query server protocol to use
> streaming IO instead of blocking batches like we do now, and use JS-native
> implementation of the JS-side instead of C-bindings.
>
> I’m partial to doing this straight in Node, because there is a ton of support
> for things we need already, and I believe we’ve solved the isolation issues
> required for secure MapReduce, but I’m happy to use any other thing as well,
> if it helps.
>
> Other benefits would be support for emerging JS features that devs will want
> to use.
>
> And we can have two modes: standalone QS like now, and embedded QS where,
> say, V8 is compiled into the Erlang VM. Not everybody will want to run this,
> but it’ll be neat for those who do.
>
>
> * * *
>
>
> # Cluster
>
> ## Rebalancing
>
> With this we will be able to grow clusters one by one instead of hitting a
> wall when eventually each shard lives on a single machine. E.g. when you add
> a node to the cluster, all other nodes share 1/Nth of their data with the new
> node, and everything can keep going. Same for removing a node and shrinking
> the cluster.
>
> Couchbase has this and it is really nice.
>
>
> ## Setup
>
> Even without rebalancing, we need a nice Fauxton UI to manage the cluster, so
> far we only have a simple setup procedure (which is great don’t get me
> wrong), but users will want to do more elaborate cluster management and we
> should make that easy with a slick UI.
>
>
> ## Cluster-Aware Clients
>
> This might end up being not a good idea, but I’d like some experimentation
> here. Say you’d have a CouchDB client that could be hooked into the cluster
> topology so it’d know which nodes to query for which data, then we can save a
> proxy-hop, and build clients that have lower-latency access to CouchDB.
> Again, this is something that Couchbase does and I think is worth exploring.
>
>
>
> * * *
>
>
> # Fauxton
>
> Fauxton is great, but it could be better too, I think. I’m mostly concerned
> about number of clicks/taps required for more specialised actions (like
> setting the group_level of a reduce query, it’s like 15 or so). More cluster
> info would also be nice, and maybe a specialised dashboard for db-per-user
> setups.
>
>
> * * *
>
>
> # Releases
>
>
> ## Six-Week Release Trains
>
> We need to get back to frequent releases and I propose to go back to our
> six-week-release train plans from three years ago. Whatever lands within a
> release train time frame goes out. The nature of the change dictates the
> version number increment as per semver, and we just ship a new version every
> six weeks, even if it only includes a single bug fix. We should automate most
> of this infrastructure, so actual releases are cheap. We are reasonably close
> with this, but we need some more folks to step up on using and maintaining
> our CI systems.
>
>
> ## One major feature per major version
>
> I also propose to keep the scope of future major versions small, so we don’t
> have to wait another 3-5 years for 3.0. In particular, I think we should
> focus on a single major feature per major version and get that shipped within
> 6-12 months tops. If anything needs more time, it needs to be broken up. Of
> course we continue to add features and fix things while this happens, but as
> a project, there is *one* major feature we push. For example, for 3.0 I see
> our push be behind HTTP2 support. There is a lot of subsequent work required
> to make that happen, so it’ll be a worthwhile 3.0, but we can ship it in 6-12
> months (hopefully).
>
> Best case scenario, we have CouchDB 4.0 coming out 12 months from now with
> two new major features. That would be amazing.
>
>
> * * *
>
>
> # Performance
>
> ## Perf Team
>
> We need a team to comprehensive look at CouchDB performance. There is a lot
> of low-hanging fruit like Robert Kowalski showed a while back, we should get
> back into this. I’m mostly inspired by SQLite who’ve done a release a while
> back that only focussed on 1-2% performance improvements, but got like 20-30
> of those and made the thing a lot faster across the board. I can’t remember
> where I read about this, but I’ll update this once I find the link.
>
>
> ## Benchmark Suite
>
> We need a benchmark suite that tests a variety of different work loads. The
> goal here is to run different versions of CouchDB against the same suite on
> the same hardware, to see where are going. I’m imagining a
> http://arewefastyet.com style dashboard where we can track this, and even run
> this on Pull Requests and not allow them if they significantly impact
> performance.
>
>
> ## Synthetic Load Suite
>
> This one is for end users. I’d like to be able to say: My app produces mostly
> 10-20kb-sized docs, but millions of those in a single database, or across
> 1000s of databases, with these views etc. and then run this on target
> hardware so I’d know, e.g. how many nodes I need for a cluster with my
> estimated workload. I know this can only be done in approximation, but I
> think this could make a big difference in CouchDB adoption and feed back into
> Perf Team mentioned above.
>
> * * *
>
>
> # Internals
>
> ## Consolidate Repositories
>
> With 2.0 we started to experiment with radically small modules for our
> components and I think we’ve come to the conclusion that some consolidation
> is better for us going forward. Obvious candidates for separate repos are
> docs, Fauxton etc. but also some of the Erlang modules that other projects
> reasonably would use.
>
>
> ## Elixir
>
> I’d like it very much if we elevate Elixir as a prime target language for
> writing CouchDB internals. I believe this would get us an influx of new
> developers that we badly need to get all the things I’m listing here done.
> Somebody might be looking into the technical aspects of this already, but we
> need to decide as a project if we are okay with that.
>
>
> ## GitHub Issues
>
> I hope we can transition to GitHub Issues soon.
>
> * * *
>
>
> # Builds
>
> I’d like automated builds for source, Docker et.al., rpm, deb, brew, ports,
> Mac Binary, etc with proper release channels for people to subscribe to, all
> powered by CI for nightly builds, so people can test in-development versions
> easily.
>
> I’d also like builds that include popular community plugins like Geo or
> Fulltext Search.
>
>
>
> * * *
>
>
> # Features
>
> ## Better Support for db-per-user
>
> I don’t know what this will look like, but this is a pattern, and we need to
> support it better.
>
> One approach could be “virtual dbs” that are backed by a single database, but
> that’s usually at odds with views, so we could make this an XOR and disable
> views on these dbs. Since this usually powers client-heavy apps, querying
> usually happens there anyway.
>
> Another approach would be better / easier cross-db aggregation or querying.
> There are a few approaches, but nothing really slick.
>
>
> ## Schema Extraction
>
> I have half an (old) patch that extracts top level fields from a document and
> stores them with a hash in an “attachment” to the database header. So we only
> end up storing doc values and the schema hash. First of all this trades
> storage for CPU time (I haven’t measured anything yet), but more
> interestingly, we could use that schema data to do smart things like
> auto-generating a validation function / mango expression based on the data
> that is already in the database. And other fun things like easier schema
> migration operations that are native in CouchDB and thus a lot faster than
> external ones. For the curious ones, I’ve got the idea from V8’s property
> access optimisation strategy[10].
>
> [10]: https://github.com/v8/v8/wiki/Design%20Elements#fast-property-access
>
> * * *
>
> Alright, that’s it for now. Can’t wait for your feedback!
>
> Best
> Jan
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>