> On 28 Sep 2016, at 21:19, Reddy B. <[email protected]> wrote: > > Hi, > > Many thanks for your comprehensive answer. Indeed what I seek is the more > composable approach you describe. > > Thanks, fortunately I have nothing for COM in particular, it was just an > attempt to illustrate my point in case it wasn't clear enough. > > From what I understand in your message, I'll need to look into couch_epi and > the "mochiweb to cowboy" topic you mention
Just a heads up: mochiweb to cowboy is in pre-discussion stage at best right now ;) > to make sure that what I'm after is not actually already supported. I've seen > that in practice, the more mature extensions of the ecosystem, > "couchdb-lucene" and "GeoCouch" took what looks like a fork approach, so I > assumed the framework wasn't existing. > > I don't mind having to learn Erlang to be honest with you, what I don't want > as a third-party developer is to have to rebuild my extensions whenever there > is a new minor release of CouchDb. I don't want to have to maintain a fork. > > Perhaps it's worth noting that we mainly develop on Windows where building > things from source is not as common as on Linux. But I'm sure that > independently of the platform, in terms of development effort, it is a very > big plus to have access to the pipeline and some context dynamically, without > having to rebuild anything. I'm thinking about a plugin architecture like the > one used for Visual Studio Extensions in terms of usage (like for COM I am > not trying to push for a particular implementation, just illustrating my > point) > > In terms of use case, the idea is to be able to extend Couchdb as easily as > by dropping a binary in a folder and activating the extension via Fauxton. > The author of such an extension wouldn't have to update it for 3.x.x > releases, but would find reasonable to be required to do a update when 4.0.0 > or 5.0.0 is released I had designed and implemented a plugin system that would allow pretty much this for 1.x, but it never made it out of beta. It’s still in 2.0, but installation doesn’t take any cluster-things into account, which a fully-qualified feature definitely should. https://github.com/apache/couchdb-couch-plugins Best Jan -- > > > On 28/09/2016 20:26, Robert Samuel Newson wrote: >> Hi, >> >> We can certainly do better on this front. I will say that the (now >> venerable) couchdb-lucene project had no problem extending couchdb into >> full-search capability without source modifications. >> >> In 2.0, it's true that we've made things harder to plugin. The couch_epi >> application is our general answer here, it allows a programmatic override to >> various places, and we can expand on those hook points easily enough. It >> does mean writing erlang code, though. >> >> When we talk about switching from mochiweb to cowboy, we gain another >> possibility to allow extensions through cowboy middleware. >> >> To truly make couchdb extensible/pluggable to the degree you seem to be >> asking for would be more work than that, I think. Under the covers, of >> course, couchdb is already composed of a large set of independent processes >> that communicate with each other using messages. >> >> The couch_index/couch_mrview split from years back was specifically to allow >> for new index types to be added smoothly (geocouch was the motivating case, >> in fact). It's fair to say that it did not pan out, but other approaches >> could. >> >> I think it best not to raise the specter of COM (or CORBA), the details of >> that distract from the intention here. What you seek is a more composable >> approach, where you could assemble a system of couchdb components and custom >> components? >> >> It might help at this point to hear some more examples of the extensions you >> didn't feel able to make. >> >> B. >> >>> On 28 Sep 2016, at 13:04, Reddy B. <[email protected]> wrote: >>> >>> I've been very busy with work for one month only and when I catch up 2.0 is >>> out and you're even talking about 3.0, congratulations. >>> >>> I'd like to contribute to this list, I've not read the source code of >>> CouchDb yet so I can't be too precise but as the head of development of >>> several companies, I thought my proposition could be valuable. >>> >>> The one big regret I have with CouchDb is the difficulty to extend it. >>> Namely the necessity to rebuild CouchDb from sources to add things such as >>> Lucene, or even GeoCouch. To take our example, we would have contributed a >>> number of extensions to CouchDb already if it wasn't for that. Perhaps it's >>> just me, but there really is a psychological threshold to pass to get into >>> building a third-party project, and another one to get into forking it. I >>> personally don't know if I'll ever get around it, because there's too much >>> cost and maintenance requirements involved. >>> >>> I'm not sure exactly what the limitation is and if this is achievable, but >>> some sort of language agnostic plugin architecture/extendability pipeline >>> would be absolutely great and in my opinion can be an interesting priority >>> for a version 3.0, as it would dramatically help boost the number of >>> contributions to the CouchDb ecosystem. I'm not sure I have the terminology >>> right, but it might all come down to making the creation of custom indexes >>> rebuild-free and language agnostic. I'm thinking of something in the idea >>> of COM APIs >>> <https://msdn.microsoft.com/en-us/library/windows/desktop/ms680573%28v=vs.85%29.aspx>. >>> >>> If you find the idea interesting, I'd be happy to start getting my hands >>> dirty and work on it. >>> >>> >>> On 27/09/2016 14:56, Jan Lehnardt wrote: >>>> Hi all, >>>> >>>> apologies in advance, this is going to be a long email. >>>> >>>> >>>> I’ve been holding this back intentionally in order to be able to focus on >>>> shipping 2.0, but now that that’s out, I feel we should talk about what’s >>>> next. >>>> >>>> This email is separated into areas of work that I think CouchDB could >>>> improve on, some with very concrete plans, some with rather vague ideas. >>>> I’ve been collecting these over the past year or <strike>two</strike>five, >>>> so it’s fairly wide, but I’m sure I’m missing things that other people >>>> find important, so please add to this list. >>>> >>>> After the initial discussion here, I’ll move all of the individual issues >>>> to JIRA, so we can go down our usual process. >>>> >>>> This is basically my wish list, and I’d like this to become everyone’s >>>> wish list, so please add what I’ve been missing. :) — Note, this isn’t a >>>> free-for-all, only suggest things that you are prepared to see through >>>> being shipped, from design, implementation to docs. >>>> >>>> I don’t have a specific order for these in mind, although I have a rough >>>> idea of what we should be doing first. Putting all of this on a roadmap is >>>> going to be a fun future exercise for us, though :) >>>> >>>> One last note: this doesn’t include anything on documentation or testing. >>>> I fully expect to step our game from here on out. This list is for the >>>> technical aspects of the project. >>>> >>>> * * * >>>> >>>> These are the areas of work I’ve roughly come up with that my suggestions >>>> fit into: >>>> >>>> - API >>>> - Storage >>>> - Query >>>> - Replication >>>> - Cluster >>>> - Fauxton >>>> - Releases >>>> - Performance >>>> - Internals >>>> - Builds >>>> - Features >>>> >>>> (I’m not claiming these are any good, but it’s what I’ve got) >>>> >>>> >>>> Let’s go. >>>> >>>> >>>> * * * >>>> >>>> # API >>>> >>>> ## HTTP2 >>>> >>>> I think this is an obvious first next step. Our HTTP Layer needs work, our >>>> existing HTTP server library is not getting HTTP2 support, it’s time to >>>> attack this head-first. I’m imagining a Cowboy[1]-based HTTP layer that >>>> calls into a unified internals layer and everything will be rose-golden. >>>> HTTP2 support for Cowboy is still in progress. Maybe we can help them >>>> along, or we focus on the internals refactor first and drop Cowboy in >>>> later (not sure how feasible this approach is, but we’ll figure this out. >>>> >>>> In my head, we focus on this and call the result 3.0 in 6-12 months. That >>>> doesn’t mean we *only* do this, but this will be the focus (more on this >>>> later). >>>> >>>> There are a few fun considerations, mainly of the “avoid Python >>>> 2/3-chasm”-type. Do we re-implement the 2.0 API with all its >>>> idiosyncrasies, or do we take the opportunity to clean things up while we >>>> are at it? If yes, how and how long do we support the then old API? Do we >>>> manage this via different ports? If yes, how can this me made to work for >>>> hosting services like Cloudant? Etc. etc. >>>> >>>> [1] https://github.com/ninenines/cowboy >>>> >>>> >>>> ## Sub-Document Operations >>>> >>>> Currently a doc update needs the whole doc body sent to the server. There >>>> are some obvious performance improvements possible. For the longest time, >>>> I wanted to see if we can model sub-document operations via JSON >>>> Pointers[2]. These would roughly allow pointing to a JSON value via a URL. >>>> >>>> For example in this doc: >>>> >>>> { >>>> "_id": "123abc", >>>> "_rev": "zyx987", >>>> "contact": { >>>> "name": "", >>>> "address": { >>>> "street": "Long Street", >>>> "nr": 123 >>>> "zip": "12345" >>>> } >>>> } >>>> >>>> An update to the zip code could look like this: >>>> >>>> curl -X POST $SERVER/db/123abc/_jsonpointer/contact/address/zip?rev=zyx987 >>>> -d '54321' >>>> >>>> GET/DELETE accordingly. We could shortcut the `_jsonpointer` to just `_` >>>> if we like the short magic. >>>> >>>> JSONPointer can deal with nested objects and lists and works fairly well >>>> for this type of stuff, and it is rather simple to implement (even I could >>>> do it: >>>> https://github.com/janl/erl-jsonpointer/blob/master/src/jsonpointer.erl — >>>> This idea is literally 5 years old, it looks like, no need to use my code >>>> if there is anything better). >>>> >>>> This is just a raw idea, and I’m happy to solve this any other way, if >>>> somebody has a good approach. >>>> >>>> [2] https://tools.ietf.org/html/rfc6901 >>>> >>>> >>>> ## HTTP PATCH / JSON Diff >>>> >>>> Another stab at a similar problem are HTTP PATCH with JSON Diff, but with >>>> the inherent problems of JSON normalisation, I’m leaning towards the >>>> JSONPointer variant as simpler, but I’d be open for this as well, if >>>> someone comes up with a good approach. >>>> >>>> >>>> ## GraphQL[3] >>>> >>>> It’s rather new, but getting good traction[4]. This would be a nice >>>> addition to our API. Somebody might already be hacking on this ;) >>>> >>>> [3]: http://graphql.org >>>> [4]: http://githubengineering.com/the-github-graphql-api/ >>>> >>>> >>>> ## Mango for Document Validation >>>> >>>> The only place where we absolutely require writing JS is >>>> validate_doc_update functions. Some security behaviour can only be >>>> enforced there. With their inherent performance problems, I’d like to get >>>> doc validations out of the path of the query server and would love to find >>>> a way to validate document updates through Mango. >>>> >>>> >>>> ## Redesign Security System >>>> >>>> Our security system is slowly grown and not coherently designed. We should >>>> start over. I have many ideas and opinions, but they are out of scope for >>>> this. I think everybody here agrees that we can do better. This *very >>>> likely* will *not* include per-document ACLs as per the often stated >>>> issues with that approach in our data model. >>>> >>>> * * * >>>> >>>> >>>> # Replication >>>> >>>> This is our flagship feature of course, and there are a few things we can >>>> do better. >>>> >>>> >>>> ## Mobile-optimised extension or new version of the protocol >>>> >>>> The original protocol design didn’t take mobile devices into account and >>>> through PouchDB et.al. we are now learning that there are number of >>>> downsides to our protocol. We’ve helped a lot with introducing >>>> _bulk_get/_revs, but that’s more a bandaid than a considered strategy ;) >>>> >>>> That new version could also be HTTP2-only, to take advantage of the new >>>> connection semantics there. >>>> >>>> >>>> ## Easy way to skip deletes on sync >>>> >>>> This one is self-explanatory, mobile clients usually don’t need to sync >>>> deletes from a year ago first. Mango filters might already get us there, >>>> maybe we can do better. >>>> >>>> >>>> ## Sync a rolling subset >>>> >>>> Say you always want to keep the last 90 days of email on a mobile device >>>> with optionally back-loading older documents on user-request. It is >>>> something I could see getting a lot of traction. >>>> >>>> Today, this can be built on 1.x with clever use of _purge, but that’s >>>> hardly a good experience. I don’t know if it can be done in a cluster. >>>> >>>> >>>> ## Selective Sync >>>> >>>> There might be other criteria than “last 90 days”, so the more general >>>> solution to this problem class would be arbitrary (e.g. client-directed) >>>> selective sync, but this might be really hard as opposed to just very hard >>>> of the “last 90 days” one, so happy to punt on this first. But filters are >>>> generally not the answer, especially with large data sets. Maybe proper >>>> sync from views _changes is the answer. >>>> >>>> >>>> ## A _db_updates powered _replicator DB >>>> >>>> Running thousands+ of replications on a server is not really resource >>>> friendly today, we should teach the replicator to only run replication on >>>> active databases via _db_updates. Somebody might already be looking into >>>> this one. >>>> >>>> * * * >>>> >>>> >>>> # Storage >>>> >>>> >>>> ## Pluggable Storage Engines >>>> >>>> Paul Davis already showed some work on allowing multiple different storage >>>> backends. I’d like to see this land. >>>> >>>> ## Different Storage Backends >>>> >>>> These don’t all have to be supported by the main project, but I’d really >>>> like to see some experimentation with different backends like >>>> LevelDB[5]/RocksDB[6], InnoDB[7], SQLite[8] a native-erlang one that is >>>> optimised for space usage and not performance (I don’t want to budge on >>>> safety). Similarly, it’d be fun to see if there is a compression format >>>> that we can use as a storage backend directly, so we get full-DB >>>> compression as opposed to just per-doc compression. >>>> >>>> [5]: http://leveldb.org >>>> [6]: http://rocksdb.org >>>> [7]: https://en.wikipedia.org/wiki/InnoDB >>>> [8]: https://www.sqlite.org >>>> >>>> * * * >>>> >>>> >>>> # Query >>>> >>>> ## Teach Mango JOINs and result sorting >>>> >>>> It’s the natural path for query languages. We should make these happen. >>>> Once we have the basics, we might even be able to find a way to compile >>>> basic SQL into Mango, it’s going to be glorious :) >>>> >>>> >>>> ## “No-JavaScript”-mode >>>> >>>> I’ve hinted at this above, but I’d really like a way for users to use >>>> CouchDB productively without having to write a line of JavaScript. My main >>>> motivation is the poor performance characteristics of the Query Server >>>> (hello CGI[9]?). But even with one that is improved, it will always faster >>>> to do any, say filtering or validation operations in native Erlang. I >>>> don’t know if we can expand Mango to cover all this, and I’m not really >>>> concerned about the specifics, as long as we get there. >>>> >>>> Of course, for pro-users, the JS-variant will still be around. >>>> >>>> [9]: https://en.wikipedia.org/wiki/Common_Gateway_Interface >>>> >>>> >>>> ## Query Server V2 >>>> >>>> We need to revamp the Query Server. It is hardcoded to an out-of-date >>>> version of SpiderMonkey and we are stuck with C-bindings that barely >>>> anyone dares to look at, let alone iterate on. >>>> >>>> I believe the way forward is re-vamping the query server protocol to use >>>> streaming IO instead of blocking batches like we do now, and use JS-native >>>> implementation of the JS-side instead of C-bindings. >>>> >>>> I’m partial to doing this straight in Node, because there is a ton of >>>> support for things we need already, and I believe we’ve solved the >>>> isolation issues required for secure MapReduce, but I’m happy to use any >>>> other thing as well, if it helps. >>>> >>>> Other benefits would be support for emerging JS features that devs will >>>> want to use. >>>> >>>> And we can have two modes: standalone QS like now, and embedded QS where, >>>> say, V8 is compiled into the Erlang VM. Not everybody will want to run >>>> this, but it’ll be neat for those who do. >>>> >>>> >>>> * * * >>>> >>>> >>>> # Cluster >>>> >>>> ## Rebalancing >>>> >>>> With this we will be able to grow clusters one by one instead of hitting a >>>> wall when eventually each shard lives on a single machine. E.g. when you >>>> add a node to the cluster, all other nodes share 1/Nth of their data with >>>> the new node, and everything can keep going. Same for removing a node and >>>> shrinking the cluster. >>>> >>>> Couchbase has this and it is really nice. >>>> >>>> >>>> ## Setup >>>> >>>> Even without rebalancing, we need a nice Fauxton UI to manage the cluster, >>>> so far we only have a simple setup procedure (which is great don’t get me >>>> wrong), but users will want to do more elaborate cluster management and we >>>> should make that easy with a slick UI. >>>> >>>> >>>> ## Cluster-Aware Clients >>>> >>>> This might end up being not a good idea, but I’d like some experimentation >>>> here. Say you’d have a CouchDB client that could be hooked into the >>>> cluster topology so it’d know which nodes to query for which data, then we >>>> can save a proxy-hop, and build clients that have lower-latency access to >>>> CouchDB. Again, this is something that Couchbase does and I think is worth >>>> exploring. >>>> >>>> >>>> >>>> * * * >>>> >>>> >>>> # Fauxton >>>> >>>> Fauxton is great, but it could be better too, I think. I’m mostly >>>> concerned about number of clicks/taps required for more specialised >>>> actions (like setting the group_level of a reduce query, it’s like 15 or >>>> so). More cluster info would also be nice, and maybe a specialised >>>> dashboard for db-per-user setups. >>>> >>>> >>>> * * * >>>> >>>> >>>> # Releases >>>> >>>> >>>> ## Six-Week Release Trains >>>> >>>> We need to get back to frequent releases and I propose to go back to our >>>> six-week-release train plans from three years ago. Whatever lands within a >>>> release train time frame goes out. The nature of the change dictates the >>>> version number increment as per semver, and we just ship a new version >>>> every six weeks, even if it only includes a single bug fix. We should >>>> automate most of this infrastructure, so actual releases are cheap. We are >>>> reasonably close with this, but we need some more folks to step up on >>>> using and maintaining our CI systems. >>>> >>>> >>>> ## One major feature per major version >>>> >>>> I also propose to keep the scope of future major versions small, so we >>>> don’t have to wait another 3-5 years for 3.0. In particular, I think we >>>> should focus on a single major feature per major version and get that >>>> shipped within 6-12 months tops. If anything needs more time, it needs to >>>> be broken up. Of course we continue to add features and fix things while >>>> this happens, but as a project, there is *one* major feature we push. For >>>> example, for 3.0 I see our push be behind HTTP2 support. There is a lot of >>>> subsequent work required to make that happen, so it’ll be a worthwhile >>>> 3.0, but we can ship it in 6-12 months (hopefully). >>>> >>>> Best case scenario, we have CouchDB 4.0 coming out 12 months from now with >>>> two new major features. That would be amazing. >>>> >>>> >>>> * * * >>>> >>>> >>>> # Performance >>>> >>>> ## Perf Team >>>> >>>> We need a team to comprehensive look at CouchDB performance. There is a >>>> lot of low-hanging fruit like Robert Kowalski showed a while back, we >>>> should get back into this. I’m mostly inspired by SQLite who’ve done a >>>> release a while back that only focussed on 1-2% performance improvements, >>>> but got like 20-30 of those and made the thing a lot faster across the >>>> board. I can’t remember where I read about this, but I’ll update this once >>>> I find the link. >>>> >>>> >>>> ## Benchmark Suite >>>> >>>> We need a benchmark suite that tests a variety of different work loads. >>>> The goal here is to run different versions of CouchDB against the same >>>> suite on the same hardware, to see where are going. I’m imagining a >>>> http://arewefastyet.com style dashboard where we can track this, and even >>>> run this on Pull Requests and not allow them if they significantly impact >>>> performance. >>>> >>>> >>>> ## Synthetic Load Suite >>>> >>>> This one is for end users. I’d like to be able to say: My app produces >>>> mostly 10-20kb-sized docs, but millions of those in a single database, or >>>> across 1000s of databases, with these views etc. and then run this on >>>> target hardware so I’d know, e.g. how many nodes I need for a cluster with >>>> my estimated workload. I know this can only be done in approximation, but >>>> I think this could make a big difference in CouchDB adoption and feed back >>>> into Perf Team mentioned above. >>>> >>>> * * * >>>> >>>> >>>> # Internals >>>> >>>> ## Consolidate Repositories >>>> >>>> With 2.0 we started to experiment with radically small modules for our >>>> components and I think we’ve come to the conclusion that some >>>> consolidation is better for us going forward. Obvious candidates for >>>> separate repos are docs, Fauxton etc. but also some of the Erlang modules >>>> that other projects reasonably would use. >>>> >>>> >>>> ## Elixir >>>> >>>> I’d like it very much if we elevate Elixir as a prime target language for >>>> writing CouchDB internals. I believe this would get us an influx of new >>>> developers that we badly need to get all the things I’m listing here done. >>>> Somebody might be looking into the technical aspects of this already, but >>>> we need to decide as a project if we are okay with that. >>>> >>>> >>>> ## GitHub Issues >>>> >>>> I hope we can transition to GitHub Issues soon. >>>> >>>> * * * >>>> >>>> >>>> # Builds >>>> >>>> I’d like automated builds for source, Docker et.al., rpm, deb, brew, >>>> ports, Mac Binary, etc with proper release channels for people to >>>> subscribe to, all powered by CI for nightly builds, so people can test >>>> in-development versions easily. >>>> >>>> I’d also like builds that include popular community plugins like Geo or >>>> Fulltext Search. >>>> >>>> >>>> >>>> * * * >>>> >>>> >>>> # Features >>>> >>>> ## Better Support for db-per-user >>>> >>>> I don’t know what this will look like, but this is a pattern, and we need >>>> to support it better. >>>> >>>> One approach could be “virtual dbs” that are backed by a single database, >>>> but that’s usually at odds with views, so we could make this an XOR and >>>> disable views on these dbs. Since this usually powers client-heavy apps, >>>> querying usually happens there anyway. >>>> >>>> Another approach would be better / easier cross-db aggregation or >>>> querying. There are a few approaches, but nothing really slick. >>>> >>>> >>>> ## Schema Extraction >>>> >>>> I have half an (old) patch that extracts top level fields from a document >>>> and stores them with a hash in an “attachment” to the database header. So >>>> we only end up storing doc values and the schema hash. First of all this >>>> trades storage for CPU time (I haven’t measured anything yet), but more >>>> interestingly, we could use that schema data to do smart things like >>>> auto-generating a validation function / mango expression based on the data >>>> that is already in the database. And other fun things like easier schema >>>> migration operations that are native in CouchDB and thus a lot faster than >>>> external ones. For the curious ones, I’ve got the idea from V8’s property >>>> access optimisation strategy[10]. >>>> >>>> [10]: https://github.com/v8/v8/wiki/Design%20Elements#fast-property-access >>>> >>>> * * * >>>> >>>> Alright, that’s it for now. Can’t wait for your feedback! >>>> >>>> Best >>>> Jan > -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/
