Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Jan Lehnardt Thu, 24 Jan 2019 04:54:50 -0800

> ; I do trust when Jan says "it looks like we can pull this
> off reasonably easy enough by taking on a few growing pains”



I’m honoured :D — so far my commitment goes as far as: it is totally worth
discussing the proposal to a point where all open questions have an answer.

Whether those answers are ones we as a project like, or think are feasible to 
do given our constraints, is a separate discussion. I’m cautiously optimistic, 
but that’s about it.

Best
Jan
—


> On 24. Jan 2019, at 12:33, Michael Fair <mich...@daclubhouse.net> wrote:
> 
> Wow, I think I get it, mostly; I still haven't read the FDB docs yet, but I
> grok the replies to Will's and my email, and it sounds like FoundationDB
> has done some really good underlying work where CouchDB could, in a sense,
> become an advocate project for FDB's utility.
> 
> TLDR; +1 from me. :-)
> Is this actually worthy of a 3.0 moniker; it seems like it could be
> (breaking changes and dropping 1.X compatibility)?
> 
> 
> Some general, higher level, thoughts (that probably mimic what you guys
> have already been thinking):
> 1) I believe CouchDB, the software project, needs a growth path for adding
> features somewhat organically.  I haven't felt this has been the case
> historically when it comes to the wire and data storage aspects of the
> project.  There's been a few ideas I've wanted to experiment with in the
> replication protocol.
> 
> For example:
> * auto-resolving/merging conflict branches when we can tell/detect that the
> heads of two conflicting branches contain an identical document;
> * adding some version of a binary JSON encoding to reduce network
> utilization;
> * creating some kind of a JSON diff ability to transmit changes between
> document revisions (that is also binary-able);
> * experimenting with encrypted and private data stores to create a
> decentralized vault where secret data could be shared with other parties
> via CouchDB without revealing the secret data while in transit;
> * "object type" based HTML template forms that could be directly "filled
> out" by documents / "modified" by browsers
> 
> Overall it sounds to me like the FoundationDB changes/advantages Jan and
> Robert described so far collectively point in a direction that would,
> generally speaking, make my life easier at approaching at least some of
> these ideas.
> 
> 
> 2) I like the idea of "removing code" from Couch where it makes sense like
> this.  The Apache CouchDB project to me has always, in a sense, been more
> representative of the replication wire protocol and replication semantics
> than the erlang software project.  I really enjoy the idea that other
> projects can incorporate their own "Couch Compatible Replication Layer" and
> use the CouchDB software project as their de facto test software.
> "Outsourcing" the KV Store work to a project like FDB where there are other
> people who enjoy focusing on that specific aspect of the problem leaves the
> Couch folks free to focus on higher level features; which I believe is a
> wonderful thing and exactly the right direction the whole project should
> generally be going.  I think of Couch as a very "end user facing" project,
> and as such, I think it benefits more than it risks (and I understand the
> risks creating outside party dependencies), by building this kind of
> outsourced dependency to a project that demonstrates; (1) technical
> competence, (2) a fair bit of maturity, (3) decent docs to help ourselves
> navigate "their world", and (4) a willingness to be responsive to CouchDB
> project requests/feedback/contributions.  It sounds like the FDB project
> scores well on all four of those points.  Having community members with
> feet/experience in both projects is a huge help/bonus.
> 
> 
> 3) This also seems to enable some ideas that I've long wanted to try but
> never thought I could do because of CouchDB's document storage design;
> index/explore the db using the Neo4j graph database and syncing docs using
> the IPLD semantics within the IPFS project (very personal to me I know, but
> still, it's nice to see the ideas look more promising to me).  I'm
> personally a really big fan of the "multiple master copies at distributed
> locations" aspect to CouchDB; more so than the "single
> distributed/sharded/parallel multiserver database" aspect.  I understand
> there are more immediate, lucrative applications to the local multiserver,
> larger database aspect; so I'm excited to see that work done too. It's
> simply not the aspect of the project that really catches my interests.
> 
> 
> 
> Thanks very much Jan/Robert for hearing what I had to say and giving great
> and meaningful replies!
> 
> Jan, I really appreciate you commenting that you understood my concerns
> about taking seriously the need to really incorporate the technology into
> the DNA of the community and that's what you expect to see successfully
> happen; and Robert, likewise, for adding that the existing FDB community
> could very well be interested in Couch as a "front end test project" to
> give practical application and meaning to some of their work.
> 
> While I'm not going to go so far as say I can personally vouch for the
> proposals success; I do trust when Jan says "it looks like we can pull this
> off reasonably easy enough by taking on a few growing pains" that it's
> true; and coupled with the clear amount of behind-the-scenes forethought
> that went into it; I really like it.
> 
> 
> Thanks!
> Mike
> 
> 
> On Thu, Jan 24, 2019 at 2:20 AM Robert Samuel Newson <rnew...@apache.org>
> wrote:
> 
>> Hi,
>> 
>> Thank you for the in-depth response, that’s exactly what the PMC is
>> looking for.
>> 
>> You are comprehending the nature and magnitude of the change correctly
>> here, where you suggest we could “just” write a new CouchDB Layer on top of
>> FoundationDB and achieve a similar effect. However, the nature of software
>> and software development really speaks against doing it that way, in my
>> opinion. In 2.0 we introduced an abstraction between the HTTP processing
>> layer and the lower plumbing of b-trees and file I/O with the “fabric”
>> application. This was essential to introduce clustering but it was a
>> significant architectural improvement in its own right. By reimplementing
>> below that line we can be more confident that we have preserved all the
>> necessary parts of the CouchDB API and experience. Additionally, separate
>> applications like the replicator and job scheduler can remain as they are.
>> A lot of the existing code will remain as-is, or have minor changes or
>> cleanup (the “local” mode for replication, unreachable since 2.0, can
>> finally be excised, for example).
>> 
>> To your other point, I remember the difficulty I first had when looking at
>> CouchDB. It’s in Erlang, which I’d not used before, and there is a lot of
>> subtle and tricky code at the lower tiers (see couch_key_tree.erl or
>> couch_btree.erl). By using FoundationDB for that instead I hope we
>> _increase_ the comprehensibility of CouchDB, as what remains will be its
>> essential nature and not the important but ancillary plumbing below. The
>> increased public development activity on CouchDB, the size of the ambition
>> here, and some cross-pollinating interest from those who know or are
>> interested in FoundationDB should, I hope, bring more active developers of
>> all levels of experience and interest to our project.
>> 
>> B.
>> 
>>> On 23 Jan 2019, at 23:27, Michael Fair <mich...@daclubhouse.net> wrote:
>>> 
>>> As someone who isn't as directly involved in the release-to-release
>>> development, would a move like this make it easier or harder for
>> new/casual
>>> community members to get up to speed/understand what's going on?
>>> 
>>> As projects grow and mature, the introductory learning curve tends to get
>>> steeper, making it harder for people who didn't "grow up with the
>> project"
>>> to grok the project as a whole thing.  Not complaining, just identifying.
>>> 
>>> Is this proposal suggesting something more akin to a storage layer
>>> separation (making it somewhat easier to identify the separate component
>>> layers and experiment with different backends) or more like a storage
>>> technology change (where any experimenter would first have to understand
>>> how FDB semantics are different from File I/O semantics)?
>>> 
>>> All in all it sounds like a promising proposal.
>>> My first thought was something like "Hmm, is this different than simply
>>> adding a 'Couch Replication Protocol' module to FoundationDB? Probably,
>> or
>>> they wouldn't be proposing it this way"
>>> 
>>> Followed quickly by, "Okay, looks like I'll likely need to start learning
>>> FoundationDB now too if I really want to understand CouchDB's
>>> capabilities.  I've not really heard much/looked at it before..."
>>> 
>>> I don't think a new learning curve should dissuade people from adopting
>> it,
>>> but as I haven't looked at the educational materials available, I can't
>>> speak to the level of "ownership" the general community would be able to
>>> keep.
>>> 
>>> My experience is, generally speaking, people simply avoid aspects of a
>>> project they don't feel competent in.  Leaving that work to those with
>>> stronger opinions/convictions/interest. And that the easier it is to
>>> independently "get up to speed" on that aspect of the project (reading a
>>> blog(s)/watching a video(s)/tracing code) the more likely an interested
>>> party is to contribute there.
>>> 
>>> It'd be great to find out that a consequence of this move makes it easier
>>> for interested people, still unfamiliar with CouchDB's internals, to get
>>> more involved because there were some great and easily accessible
>> teaching
>>> materials...
>>> 
>>> This concept obviously isn't unique to this FDB proposal; nor is it
>>> advocating for or against; I guess it's just expressing a hope that the
>>> impact is made to also help those who would like to get started
>>> contributing to CouchDB in meaningful ways instead of them getting a new
>>> and more complicated third party tech dependency to go learn as well.
>>> 
>>> Mike
>>> 
>>> PS While I assume there's likely very clear answers, does this differ
>>> significantly from the idea of giving FoundationDB a Couch compatible web
>>> API interface?  Like instead of making FoundationDB "the storage backend"
>>> for Couch, why not add a Couch compatible web interface front end to
>>> FoundationDB?  Is there a lot of useful Couch code in between those two
>>> things?
>>> 
>>> 
>>> On Wed, Jan 23, 2019 at 12:20 PM Joan Touzet <woh...@apache.org> wrote:
>>> 
>>>> Hi everyone,
>>>> 
>>>> As Jan mentions, the PMC has had a couple of weeks to prepare on this.
>>>> 
>>>> As a non-IBMer (though an ex-IBM-er and ex-Cloudant-er), I've had my
>>>> Apache PMC hat on the entire time, considering all of the things
>>>> that Jan mentions and more. My primary concern has been ensuring that,
>>>> should this go forward, what happens occurs in the project's best
>>>> interest.
>>>> 
>>>> During the analysis process I came up with 8 serious topics that we
>>>> need to sort out:
>>>> 
>>>> * RFC process - how major changes are proposed/designed/accepted,
>>>>               see new GitHub template for a preview on this
>>>> 
>>>> * Bylaws review - namely, should we insist on +1s from outside
>>>>                 your company for big things? Plus RFC/deprecations.
>>>> 
>>>> * Roadmap - we have a roadmap from ~24 months ago that represented
>>>>           our goals for CouchDB 2.x and 3.x. What happens to it?
>>>>           https://s.apache.org/couch2xroadmap
>>>> 
>>>> * Onboarding - better mentoring in The Apache Way and The CouchDB
>>>>              Way for new members (from IBM and elsewhere)
>>>> 
>>>> * (Re-)Branding - how do we differentiate between "CouchDB Classic"
>>>>                 and "New CouchDB" in a succinct and clear way?
>>>> 
>>>> * FoundationDB - all the non-technical aspects. Review of _their_
>>>>                project governance, cross-project pollination, us
>>>>                learning the core and pros/cons, identifying who
>>>>                will actually learn that code base, and operational
>>>>                considerations. Also: keeping this knowledge public
>>>>                and not just "inside IBM's dev/ops teams".
>>>> 
>>>> * Proj. Mgmt. - Obviously IBM will have a PM involved. We should too.
>>>>               Reviewing process/procedure and ensuring a smooth
>>>>               collaboration is critical. IBM doesn't get to just
>>>>               throw code over the wall at us. Similarly, should we
>>>>               choose to work on proposed features, or stuff from
>>>>               the roadmap, we need to be able to cooperate. No
>>>>               cookie licking allowed![*]
>>>> 
>>>> * Tech deep dives - this will actually be many, many threads I expect,
>>>>                   including everyone's favourite on release mgmt :P
>>>> 
>>>> New threads will be started on these topics by PMC members over the
>>>> coming days (but not all at once, so everyone has time to reflect and
>>>> respond.)
>>>> 
>>>> My initial take on the proposal: it's GOOD that we're finally
>>>> addressing some of the problems that 2.x brought to the table, and if
>>>> this is the best way to do so, then so be it. I want to know more
>>>> about the technical details, and I want to see a more formal RFC before
>>>> voting on it, though.
>>>> 
>>>> -Joan 'And now for something completely different...' Touzet
>>>> 
>>>> [*] http://communitymgt.wikia.com/wiki/Cookie_Licking
>>>> 
>>>> 
>>>> ----- Original Message -----
>>>>> From: "Jan Lehnardt" <j...@apache.org>
>>>>> To: "CouchDB Developers" <dev@couchdb.apache.org>
>>>>> Sent: Wednesday, January 23, 2019 8:33:30 AM
>>>>> Subject: Re: [DISCUSS] Rebase CouchDB on top of FoundationDB
>>>>> 
>>>>> Hi Bob,
>>>>> 
>>>>> this is all very exciting!
>>>>> 
>>>>> First up, full disclosure, the CouchDB PMC has had about two weeks to
>>>>> think about this already, so if any of the following doesn’t sound
>>>>> like a knee-jerk reaction, that’s why.
>>>>> 
>>>>> I’m personally tentatively optimistic about this proposal and I’m
>>>>> willing to work through all open questions from governance,
>>>>> contribution management to the technical bits to see if we as the
>>>>> CouchDB project arrive at a point where we are comfortable going
>>>>> down this path.
>>>>> 
>>>>> The PMC has already identified a set of discussion areas for this
>>>>> dev@ mailing list to go through before any definite decision can be
>>>>> made. Separate emails for those discussions are going to be posted
>>>>> on this list shortly, so I won’t go into further detail here.
>>>>> 
>>>>> If anyone sees a need for discussion beyond the threads that will
>>>>> appear here, please speak up at your earliest convenience. This
>>>>> proposal would mean a big step for our project, and we must make
>>>>> sure to hear all voices.
>>>>> 
>>>>> Once we’ve gone through all this, the resulting answers to all the
>>>>> open questions coming up will end up in a consensus finding process
>>>>> on this mailing list, which will signify the final project decision.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> That said, I’d like to highlight one of these topics: IBM/Cloudant’s
>>>>> contributions going forward.
>>>>> 
>>>>> Looking at how 2.0 came to be, the contributions were mostly taken on
>>>>> good faith (and legal review), and from the trust Cloudant built up
>>>>> operating a large number of large instances of clusters of what
>>>>> would eventually become CouchDB 2.0. It has clearly paid off for
>>>>> CouchDB and our current level of success wouldn’t be without
>>>>> IBM/Cloudant.
>>>>> 
>>>>> However, some of the ways we work with the IBM team leave things to
>>>>> be desired. Specifically, the Apache CouchDB community is frequently
>>>>> not involved in design discussions around new features. Those happen
>>>>> inside IBM and we “only” get a PR that then goes through the regular
>>>>> review process. Again, this has served us well, but we can do even
>>>>> better, so I’d like to take the opportunity of this larger proposal
>>>>> to suggest we actually do better. As promised, a more detailed
>>>>> thread about this is going to come up, and it’ll be the right place
>>>>> to go through the minutiae of this.
>>>>> 
>>>>> With this structural change, I believe we are in a great position to
>>>>> work through the details of this proposal and the subsequent design
>>>>> and engineering steps.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> Finally, I want to reiterate Bob’s point: while this proposal is
>>>>> largely driven by IBM, IBM has no power to unilaterally force the
>>>>> CouchDB project to accept this proposal and they have already
>>>>> signalled and worked towards making this a mutually beneficial
>>>>> endeavour. The CouchDB project has different objectives from IBM and
>>>>> it is up to us to come up with a proposal that satisfies all of our
>>>>> objectives as well as IBMs, should this motion pass.
>>>>> 
>>>>> Best
>>>>> Jan
>>>>> —
>>>>> 
>>>>> 
>>>>>> On 23. Jan 2019, at 11:00, Robert Samuel Newson
>>>>>> <rnew...@apache.org> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> CouchDB 2.0 introduced clustering; the ability to scale a single
>>>>>> database across multiple nodes, increasing both the maximum size
>>>>>> of a database and adding native fault-tolerance. This welcome and
>>>>>> considerable step forward was not without its trade-offs. In the
>>>>>> years since 2.0 was released, users frequently encounter the
>>>>>> following issues as a direct consequence of the 2.0 clustering
>>>>>> approach:
>>>>>> 
>>>>>> 1. Conflict revisions can be created on normal concurrent updates
>>>>>> issued to a single database, since each replica of a database
>>>>>> shard independently chooses whether to accept a given update, and
>>>>>> all replicas will eventually propagate updates that any one of
>>>>>> them has chosen to accept.
>>>>>> 2. Secondary indexes ("views") do not scale the same way as
>>>>>> document lookups, as they are sharded by doc id, not emitted view
>>>>>> key (thus forcing a consultation of all shard ranges for each
>>>>>> query).
>>>>>> 3. The changes feed is no longer totally ordered and, worse, could
>>>>>> replay earlier changes in the event of a node failure (even a
>>>>>> temporary one).
>>>>>> 
>>>>>> The idea is to use FoundationDB as the new CouchDB foundational
>>>>>> layer, letting it take care of data storage and placement. An
>>>>>> introduction to FoundationDB would take up too much space here so
>>>>>> I will summarise it as a highly scalable ordered key-value store
>>>>>> with transactional semantics, provides strong consistency, scaling
>>>>>> from a single node to many. It is licensed under the ASLv2 but is
>>>>>> not an Apache project.
>>>>>> 
>>>>>> By using FoundationDB we can solve all three of the problems listed
>>>>>> above and deliver semantics much closer to CouchDB 1.x's behaviour
>>>>>> while improving upon the scalability advantages that 2.0
>>>>>> introduced. The essential character of CouchDB would be preserved
>>>>>> (MVCC for documents, replication between CouchDB databases) but
>>>>>> the underlying plumbing would change significantly. In addition,
>>>>>> this new foundation will allow us to add long wished-for features
>>>>>> more easily. For example, multi-document transactions become
>>>>>> possible, as does efficient field-level reading and writing. A
>>>>>> further thought is the ability to update views transactionally
>>>>>> with the database update.
>>>>>> 
>>>>>> For those familiar with the CouchDB 2.0 architecture, the proposal
>>>>>> is, in effect, to change all the functions in fabric.erl so that
>>>>>> they work against a (possibly remote) FoundationDB cluster instead
>>>>>> of the current implementation of calling into the original CouchDB
>>>>>> 1.x code (couch_btree, couch_file, etc).
>>>>>> 
>>>>>> This is a large change and, for full disclosure, the IBM Cloudant
>>>>>> team are proposing it. We have done our due diligence in
>>>>>> investigating FoundationDB as well as detailed investigation into
>>>>>> how CouchDB semantics would be built on top of FoundationDB. Any
>>>>>> and all decisions on that must take place here on the CouchDB
>>>>>> developer mailing list, of course, but we are confident that this
>>>>>> is feasible.
>>>>>> During those investigations we have identified a small number of
>>>>>> CouchDB features that we do not yet see a way to do on
>>>>>> FoundationDB, the main one being custom (Javascript) reduces. This
>>>>>> is a direct consequence of no longer rolling our own persistence
>>>>>> layer (couch_btree and friends) and would likely apply to any
>>>>>> alternative technology.
>>>>>> 
>>>>>> I think this would be a great advance for CouchDB, preserving what
>>>>>> makes CouchDB special but taking advantage of the superbly
>>>>>> engineered FoundationDB software at the bottom of the stack.
>>>>>> 
>>>>>> Regards,
>>>>>> Robert Newson
>>>>> 
>>>>> --
>>>>> Professional Support for Apache CouchDB:
>>>>> https://neighbourhood.ie/couchdb-support/
>>>>> 
>>>>> 
>>>> 
>> 
>> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Reply via email to