> On 29 Sep 2016, at 08:42, Reddy B. <[email protected]> wrote: > > Jan and Paul, > > Thanks for all this insight, this is awesome. > > > couch-plugins > > This is it ! Looks very exciting, I'll absolutely start looking into the > source code this week.
Fantastic, thank you, I’d love to see this revived. The strong consistency thing that Nick mentioned might play into this, as we’d want that for a cluster-wide configuration system, and plugins could probably very nicely make use of that. Best Jan -- > > > On 28/09/2016 21:49, Paul Davis wrote: >> I definitely screwed up on the mrview/index split in hindsight. >> couch_index should have been a library app that indexers could use as >> a toolbox rather than a weirdo plugin/callback system like it is now. >> Live and learn! >> >> For the extensibility aspect, once we get a solid abstraction like >> we've been talking about for the HTTP API it seems like this sort of >> thing would be a lot easier in so much as it'd be standard Erlang >> procedure for reusing applications. And we could look into having >> releases of components available on hex.pm. >> >> On Wed, Sep 28, 2016 at 1:26 PM, Robert Samuel Newson >> <[email protected]> wrote: >>> Hi, >>> >>> We can certainly do better on this front. I will say that the (now >>> venerable) couchdb-lucene project had no problem extending couchdb into >>> full-search capability without source modifications. >>> >>> In 2.0, it's true that we've made things harder to plugin. The couch_epi >>> application is our general answer here, it allows a programmatic override >>> to various places, and we can expand on those hook points easily enough. It >>> does mean writing erlang code, though. >>> >>> When we talk about switching from mochiweb to cowboy, we gain another >>> possibility to allow extensions through cowboy middleware. >>> >>> To truly make couchdb extensible/pluggable to the degree you seem to be >>> asking for would be more work than that, I think. Under the covers, of >>> course, couchdb is already composed of a large set of independent processes >>> that communicate with each other using messages. >>> >>> The couch_index/couch_mrview split from years back was specifically to >>> allow for new index types to be added smoothly (geocouch was the motivating >>> case, in fact). It's fair to say that it did not pan out, but other >>> approaches could. >>> >>> I think it best not to raise the specter of COM (or CORBA), the details of >>> that distract from the intention here. What you seek is a more composable >>> approach, where you could assemble a system of couchdb components and >>> custom components? >>> >>> It might help at this point to hear some more examples of the extensions >>> you didn't feel able to make. >>> >>> B. >>> >>>> On 28 Sep 2016, at 13:04, Reddy B. <[email protected]> wrote: >>>> >>>> I've been very busy with work for one month only and when I catch up 2.0 >>>> is out and you're even talking about 3.0, congratulations. >>>> >>>> I'd like to contribute to this list, I've not read the source code of >>>> CouchDb yet so I can't be too precise but as the head of development of >>>> several companies, I thought my proposition could be valuable. >>>> >>>> The one big regret I have with CouchDb is the difficulty to extend it. >>>> Namely the necessity to rebuild CouchDb from sources to add things such as >>>> Lucene, or even GeoCouch. To take our example, we would have contributed a >>>> number of extensions to CouchDb already if it wasn't for that. Perhaps >>>> it's just me, but there really is a psychological threshold to pass to get >>>> into building a third-party project, and another one to get into forking >>>> it. I personally don't know if I'll ever get around it, because there's >>>> too much cost and maintenance requirements involved. >>>> >>>> I'm not sure exactly what the limitation is and if this is achievable, but >>>> some sort of language agnostic plugin architecture/extendability pipeline >>>> would be absolutely great and in my opinion can be an interesting priority >>>> for a version 3.0, as it would dramatically help boost the number of >>>> contributions to the CouchDb ecosystem. I'm not sure I have the >>>> terminology right, but it might all come down to making the creation of >>>> custom indexes rebuild-free and language agnostic. I'm thinking of >>>> something in the idea of COM APIs >>>> <https://msdn.microsoft.com/en-us/library/windows/desktop/ms680573%28v=vs.85%29.aspx>. >>>> >>>> If you find the idea interesting, I'd be happy to start getting my hands >>>> dirty and work on it. >>>> >>>> >>>> On 27/09/2016 14:56, Jan Lehnardt wrote: >>>>> Hi all, >>>>> >>>>> apologies in advance, this is going to be a long email. >>>>> >>>>> >>>>> I’ve been holding this back intentionally in order to be able to focus on >>>>> shipping 2.0, but now that that’s out, I feel we should talk about what’s >>>>> next. >>>>> >>>>> This email is separated into areas of work that I think CouchDB could >>>>> improve on, some with very concrete plans, some with rather vague ideas. >>>>> I’ve been collecting these over the past year or >>>>> <strike>two</strike>five, so it’s fairly wide, but I’m sure I’m missing >>>>> things that other people find important, so please add to this list. >>>>> >>>>> After the initial discussion here, I’ll move all of the individual issues >>>>> to JIRA, so we can go down our usual process. >>>>> >>>>> This is basically my wish list, and I’d like this to become everyone’s >>>>> wish list, so please add what I’ve been missing. :) — Note, this isn’t a >>>>> free-for-all, only suggest things that you are prepared to see through >>>>> being shipped, from design, implementation to docs. >>>>> >>>>> I don’t have a specific order for these in mind, although I have a rough >>>>> idea of what we should be doing first. Putting all of this on a roadmap >>>>> is going to be a fun future exercise for us, though :) >>>>> >>>>> One last note: this doesn’t include anything on documentation or testing. >>>>> I fully expect to step our game from here on out. This list is for the >>>>> technical aspects of the project. >>>>> >>>>> * * * >>>>> >>>>> These are the areas of work I’ve roughly come up with that my suggestions >>>>> fit into: >>>>> >>>>> - API >>>>> - Storage >>>>> - Query >>>>> - Replication >>>>> - Cluster >>>>> - Fauxton >>>>> - Releases >>>>> - Performance >>>>> - Internals >>>>> - Builds >>>>> - Features >>>>> >>>>> (I’m not claiming these are any good, but it’s what I’ve got) >>>>> >>>>> >>>>> Let’s go. >>>>> >>>>> >>>>> * * * >>>>> >>>>> # API >>>>> >>>>> ## HTTP2 >>>>> >>>>> I think this is an obvious first next step. Our HTTP Layer needs work, >>>>> our existing HTTP server library is not getting HTTP2 support, it’s time >>>>> to attack this head-first. I’m imagining a Cowboy[1]-based HTTP layer >>>>> that calls into a unified internals layer and everything will be >>>>> rose-golden. HTTP2 support for Cowboy is still in progress. Maybe we can >>>>> help them along, or we focus on the internals refactor first and drop >>>>> Cowboy in later (not sure how feasible this approach is, but we’ll figure >>>>> this out. >>>>> >>>>> In my head, we focus on this and call the result 3.0 in 6-12 months. That >>>>> doesn’t mean we *only* do this, but this will be the focus (more on this >>>>> later). >>>>> >>>>> There are a few fun considerations, mainly of the “avoid Python >>>>> 2/3-chasm”-type. Do we re-implement the 2.0 API with all its >>>>> idiosyncrasies, or do we take the opportunity to clean things up while we >>>>> are at it? If yes, how and how long do we support the then old API? Do we >>>>> manage this via different ports? If yes, how can this me made to work for >>>>> hosting services like Cloudant? Etc. etc. >>>>> >>>>> [1] https://github.com/ninenines/cowboy >>>>> >>>>> >>>>> ## Sub-Document Operations >>>>> >>>>> Currently a doc update needs the whole doc body sent to the server. There >>>>> are some obvious performance improvements possible. For the longest time, >>>>> I wanted to see if we can model sub-document operations via JSON >>>>> Pointers[2]. These would roughly allow pointing to a JSON value via a URL. >>>>> >>>>> For example in this doc: >>>>> >>>>> { >>>>> "_id": "123abc", >>>>> "_rev": "zyx987", >>>>> "contact": { >>>>> "name": "", >>>>> "address": { >>>>> "street": "Long Street", >>>>> "nr": 123 >>>>> "zip": "12345" >>>>> } >>>>> } >>>>> >>>>> An update to the zip code could look like this: >>>>> >>>>> curl -X POST >>>>> $SERVER/db/123abc/_jsonpointer/contact/address/zip?rev=zyx987 -d '54321' >>>>> >>>>> GET/DELETE accordingly. We could shortcut the `_jsonpointer` to just `_` >>>>> if we like the short magic. >>>>> >>>>> JSONPointer can deal with nested objects and lists and works fairly well >>>>> for this type of stuff, and it is rather simple to implement (even I >>>>> could do it: >>>>> https://github.com/janl/erl-jsonpointer/blob/master/src/jsonpointer.erl — >>>>> This idea is literally 5 years old, it looks like, no need to use my code >>>>> if there is anything better). >>>>> >>>>> This is just a raw idea, and I’m happy to solve this any other way, if >>>>> somebody has a good approach. >>>>> >>>>> [2] https://tools.ietf.org/html/rfc6901 >>>>> >>>>> >>>>> ## HTTP PATCH / JSON Diff >>>>> >>>>> Another stab at a similar problem are HTTP PATCH with JSON Diff, but with >>>>> the inherent problems of JSON normalisation, I’m leaning towards the >>>>> JSONPointer variant as simpler, but I’d be open for this as well, if >>>>> someone comes up with a good approach. >>>>> >>>>> >>>>> ## GraphQL[3] >>>>> >>>>> It’s rather new, but getting good traction[4]. This would be a nice >>>>> addition to our API. Somebody might already be hacking on this ;) >>>>> >>>>> [3]: http://graphql.org >>>>> [4]: http://githubengineering.com/the-github-graphql-api/ >>>>> >>>>> >>>>> ## Mango for Document Validation >>>>> >>>>> The only place where we absolutely require writing JS is >>>>> validate_doc_update functions. Some security behaviour can only be >>>>> enforced there. With their inherent performance problems, I’d like to get >>>>> doc validations out of the path of the query server and would love to >>>>> find a way to validate document updates through Mango. >>>>> >>>>> >>>>> ## Redesign Security System >>>>> >>>>> Our security system is slowly grown and not coherently designed. We >>>>> should start over. I have many ideas and opinions, but they are out of >>>>> scope for this. I think everybody here agrees that we can do better. This >>>>> *very likely* will *not* include per-document ACLs as per the often >>>>> stated issues with that approach in our data model. >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Replication >>>>> >>>>> This is our flagship feature of course, and there are a few things we can >>>>> do better. >>>>> >>>>> >>>>> ## Mobile-optimised extension or new version of the protocol >>>>> >>>>> The original protocol design didn’t take mobile devices into account and >>>>> through PouchDB et.al. we are now learning that there are number of >>>>> downsides to our protocol. We’ve helped a lot with introducing >>>>> _bulk_get/_revs, but that’s more a bandaid than a considered strategy ;) >>>>> >>>>> That new version could also be HTTP2-only, to take advantage of the new >>>>> connection semantics there. >>>>> >>>>> >>>>> ## Easy way to skip deletes on sync >>>>> >>>>> This one is self-explanatory, mobile clients usually don’t need to sync >>>>> deletes from a year ago first. Mango filters might already get us there, >>>>> maybe we can do better. >>>>> >>>>> >>>>> ## Sync a rolling subset >>>>> >>>>> Say you always want to keep the last 90 days of email on a mobile device >>>>> with optionally back-loading older documents on user-request. It is >>>>> something I could see getting a lot of traction. >>>>> >>>>> Today, this can be built on 1.x with clever use of _purge, but that’s >>>>> hardly a good experience. I don’t know if it can be done in a cluster. >>>>> >>>>> >>>>> ## Selective Sync >>>>> >>>>> There might be other criteria than “last 90 days”, so the more general >>>>> solution to this problem class would be arbitrary (e.g. client-directed) >>>>> selective sync, but this might be really hard as opposed to just very >>>>> hard of the “last 90 days” one, so happy to punt on this first. But >>>>> filters are generally not the answer, especially with large data sets. >>>>> Maybe proper sync from views _changes is the answer. >>>>> >>>>> >>>>> ## A _db_updates powered _replicator DB >>>>> >>>>> Running thousands+ of replications on a server is not really resource >>>>> friendly today, we should teach the replicator to only run replication on >>>>> active databases via _db_updates. Somebody might already be looking into >>>>> this one. >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Storage >>>>> >>>>> >>>>> ## Pluggable Storage Engines >>>>> >>>>> Paul Davis already showed some work on allowing multiple different >>>>> storage backends. I’d like to see this land. >>>>> >>>>> ## Different Storage Backends >>>>> >>>>> These don’t all have to be supported by the main project, but I’d really >>>>> like to see some experimentation with different backends like >>>>> LevelDB[5]/RocksDB[6], InnoDB[7], SQLite[8] a native-erlang one that is >>>>> optimised for space usage and not performance (I don’t want to budge on >>>>> safety). Similarly, it’d be fun to see if there is a compression format >>>>> that we can use as a storage backend directly, so we get full-DB >>>>> compression as opposed to just per-doc compression. >>>>> >>>>> [5]: http://leveldb.org >>>>> [6]: http://rocksdb.org >>>>> [7]: https://en.wikipedia.org/wiki/InnoDB >>>>> [8]: https://www.sqlite.org >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Query >>>>> >>>>> ## Teach Mango JOINs and result sorting >>>>> >>>>> It’s the natural path for query languages. We should make these happen. >>>>> Once we have the basics, we might even be able to find a way to compile >>>>> basic SQL into Mango, it’s going to be glorious :) >>>>> >>>>> >>>>> ## “No-JavaScript”-mode >>>>> >>>>> I’ve hinted at this above, but I’d really like a way for users to use >>>>> CouchDB productively without having to write a line of JavaScript. My >>>>> main motivation is the poor performance characteristics of the Query >>>>> Server (hello CGI[9]?). But even with one that is improved, it will >>>>> always faster to do any, say filtering or validation operations in native >>>>> Erlang. I don’t know if we can expand Mango to cover all this, and I’m >>>>> not really concerned about the specifics, as long as we get there. >>>>> >>>>> Of course, for pro-users, the JS-variant will still be around. >>>>> >>>>> [9]: https://en.wikipedia.org/wiki/Common_Gateway_Interface >>>>> >>>>> >>>>> ## Query Server V2 >>>>> >>>>> We need to revamp the Query Server. It is hardcoded to an out-of-date >>>>> version of SpiderMonkey and we are stuck with C-bindings that barely >>>>> anyone dares to look at, let alone iterate on. >>>>> >>>>> I believe the way forward is re-vamping the query server protocol to use >>>>> streaming IO instead of blocking batches like we do now, and use >>>>> JS-native implementation of the JS-side instead of C-bindings. >>>>> >>>>> I’m partial to doing this straight in Node, because there is a ton of >>>>> support for things we need already, and I believe we’ve solved the >>>>> isolation issues required for secure MapReduce, but I’m happy to use any >>>>> other thing as well, if it helps. >>>>> >>>>> Other benefits would be support for emerging JS features that devs will >>>>> want to use. >>>>> >>>>> And we can have two modes: standalone QS like now, and embedded QS where, >>>>> say, V8 is compiled into the Erlang VM. Not everybody will want to run >>>>> this, but it’ll be neat for those who do. >>>>> >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Cluster >>>>> >>>>> ## Rebalancing >>>>> >>>>> With this we will be able to grow clusters one by one instead of hitting >>>>> a wall when eventually each shard lives on a single machine. E.g. when >>>>> you add a node to the cluster, all other nodes share 1/Nth of their data >>>>> with the new node, and everything can keep going. Same for removing a >>>>> node and shrinking the cluster. >>>>> >>>>> Couchbase has this and it is really nice. >>>>> >>>>> >>>>> ## Setup >>>>> >>>>> Even without rebalancing, we need a nice Fauxton UI to manage the >>>>> cluster, so far we only have a simple setup procedure (which is great >>>>> don’t get me wrong), but users will want to do more elaborate cluster >>>>> management and we should make that easy with a slick UI. >>>>> >>>>> >>>>> ## Cluster-Aware Clients >>>>> >>>>> This might end up being not a good idea, but I’d like some >>>>> experimentation here. Say you’d have a CouchDB client that could be >>>>> hooked into the cluster topology so it’d know which nodes to query for >>>>> which data, then we can save a proxy-hop, and build clients that have >>>>> lower-latency access to CouchDB. Again, this is something that Couchbase >>>>> does and I think is worth exploring. >>>>> >>>>> >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Fauxton >>>>> >>>>> Fauxton is great, but it could be better too, I think. I’m mostly >>>>> concerned about number of clicks/taps required for more specialised >>>>> actions (like setting the group_level of a reduce query, it’s like 15 or >>>>> so). More cluster info would also be nice, and maybe a specialised >>>>> dashboard for db-per-user setups. >>>>> >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Releases >>>>> >>>>> >>>>> ## Six-Week Release Trains >>>>> >>>>> We need to get back to frequent releases and I propose to go back to our >>>>> six-week-release train plans from three years ago. Whatever lands within >>>>> a release train time frame goes out. The nature of the change dictates >>>>> the version number increment as per semver, and we just ship a new >>>>> version every six weeks, even if it only includes a single bug fix. We >>>>> should automate most of this infrastructure, so actual releases are >>>>> cheap. We are reasonably close with this, but we need some more folks to >>>>> step up on using and maintaining our CI systems. >>>>> >>>>> >>>>> ## One major feature per major version >>>>> >>>>> I also propose to keep the scope of future major versions small, so we >>>>> don’t have to wait another 3-5 years for 3.0. In particular, I think we >>>>> should focus on a single major feature per major version and get that >>>>> shipped within 6-12 months tops. If anything needs more time, it needs to >>>>> be broken up. Of course we continue to add features and fix things while >>>>> this happens, but as a project, there is *one* major feature we push. For >>>>> example, for 3.0 I see our push be behind HTTP2 support. There is a lot >>>>> of subsequent work required to make that happen, so it’ll be a worthwhile >>>>> 3.0, but we can ship it in 6-12 months (hopefully). >>>>> >>>>> Best case scenario, we have CouchDB 4.0 coming out 12 months from now >>>>> with two new major features. That would be amazing. >>>>> >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Performance >>>>> >>>>> ## Perf Team >>>>> >>>>> We need a team to comprehensive look at CouchDB performance. There is a >>>>> lot of low-hanging fruit like Robert Kowalski showed a while back, we >>>>> should get back into this. I’m mostly inspired by SQLite who’ve done a >>>>> release a while back that only focussed on 1-2% performance improvements, >>>>> but got like 20-30 of those and made the thing a lot faster across the >>>>> board. I can’t remember where I read about this, but I’ll update this >>>>> once I find the link. >>>>> >>>>> >>>>> ## Benchmark Suite >>>>> >>>>> We need a benchmark suite that tests a variety of different work loads. >>>>> The goal here is to run different versions of CouchDB against the same >>>>> suite on the same hardware, to see where are going. I’m imagining a >>>>> http://arewefastyet.com style dashboard where we can track this, and even >>>>> run this on Pull Requests and not allow them if they significantly impact >>>>> performance. >>>>> >>>>> >>>>> ## Synthetic Load Suite >>>>> >>>>> This one is for end users. I’d like to be able to say: My app produces >>>>> mostly 10-20kb-sized docs, but millions of those in a single database, or >>>>> across 1000s of databases, with these views etc. and then run this on >>>>> target hardware so I’d know, e.g. how many nodes I need for a cluster >>>>> with my estimated workload. I know this can only be done in >>>>> approximation, but I think this could make a big difference in CouchDB >>>>> adoption and feed back into Perf Team mentioned above. >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Internals >>>>> >>>>> ## Consolidate Repositories >>>>> >>>>> With 2.0 we started to experiment with radically small modules for our >>>>> components and I think we’ve come to the conclusion that some >>>>> consolidation is better for us going forward. Obvious candidates for >>>>> separate repos are docs, Fauxton etc. but also some of the Erlang modules >>>>> that other projects reasonably would use. >>>>> >>>>> >>>>> ## Elixir >>>>> >>>>> I’d like it very much if we elevate Elixir as a prime target language for >>>>> writing CouchDB internals. I believe this would get us an influx of new >>>>> developers that we badly need to get all the things I’m listing here >>>>> done. Somebody might be looking into the technical aspects of this >>>>> already, but we need to decide as a project if we are okay with that. >>>>> >>>>> >>>>> ## GitHub Issues >>>>> >>>>> I hope we can transition to GitHub Issues soon. >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Builds >>>>> >>>>> I’d like automated builds for source, Docker et.al., rpm, deb, brew, >>>>> ports, Mac Binary, etc with proper release channels for people to >>>>> subscribe to, all powered by CI for nightly builds, so people can test >>>>> in-development versions easily. >>>>> >>>>> I’d also like builds that include popular community plugins like Geo or >>>>> Fulltext Search. >>>>> >>>>> >>>>> >>>>> * * * >>>>> >>>>> >>>>> # Features >>>>> >>>>> ## Better Support for db-per-user >>>>> >>>>> I don’t know what this will look like, but this is a pattern, and we need >>>>> to support it better. >>>>> >>>>> One approach could be “virtual dbs” that are backed by a single database, >>>>> but that’s usually at odds with views, so we could make this an XOR and >>>>> disable views on these dbs. Since this usually powers client-heavy apps, >>>>> querying usually happens there anyway. >>>>> >>>>> Another approach would be better / easier cross-db aggregation or >>>>> querying. There are a few approaches, but nothing really slick. >>>>> >>>>> >>>>> ## Schema Extraction >>>>> >>>>> I have half an (old) patch that extracts top level fields from a document >>>>> and stores them with a hash in an “attachment” to the database header. So >>>>> we only end up storing doc values and the schema hash. First of all this >>>>> trades storage for CPU time (I haven’t measured anything yet), but more >>>>> interestingly, we could use that schema data to do smart things like >>>>> auto-generating a validation function / mango expression based on the >>>>> data that is already in the database. And other fun things like easier >>>>> schema migration operations that are native in CouchDB and thus a lot >>>>> faster than external ones. For the curious ones, I’ve got the idea from >>>>> V8’s property access optimisation strategy[10]. >>>>> >>>>> [10]: https://github.com/v8/v8/wiki/Design%20Elements#fast-property-access >>>>> >>>>> * * * >>>>> >>>>> Alright, that’s it for now. Can’t wait for your feedback! >>>>> >>>>> Best >>>>> Jan > -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/
