Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Jan Lehnardt Thu, 24 Jan 2019 01:44:48 -0800

Hi Mike,

these are excellent points, thank you for taking the time to chime in.


The current state of the CouchDB source code is that it is already rather 
nicely layered, and concerns are separated rather decently. That said, some of 
the core bits are gnarly and hard to get into, and would be even with copious 
documentation. I’m specifically thinking of couch_btree and couch_key_tree.

That said, I totally get that approaching the CouchDB source is a daunting task 
and I can say from personal experience that it’s a multi-year effort to get 
proficient if you’re not doing this full-time (Couple of months maybe 
full-time).

In terms of where the split happens, and I’ll leave technical details to the 
appropriate threads, is to replace *both* the underlying storage technology 
*and* the distributed systems layer. So how bits land on disk *and* how bits 
are sent over the network.

Aside: CouchDB 2.2.0 shipped with what we call pluggable storage engines, where 
one can switch out the how bits get written to disk part already. It’s just 
that there are no production ready alternative implementations just yet :)

As to how to describe all this: a cynical take would be to say we add 
replication-bits to FDB and call it CouchDB, but in detail, there is much more 
to it than just an API layer. We’ll port over how secondary indexing works 
today (including JS and Mango view servers), we’ll keep the replicator engine 
that uses the CouchDB API to synchronise two database instances, plus a bunch 
of ancillary stuff that is worth keeping around. All of these things are 
significant investments that capture the value of CouchDB.

It’s worth keeping in mind that FoundationDB is meant as a “build your own 
database on top” kind of project, and not a standalone database system. And 
building a custom database on top is what this proposal is about (vis-a-vis 
your PS), so “add ing X to FDB” is not really a thing unless you extend the 
build-your-own-db toolkit.

As for having to learn FDB behaviour when wanting to run a hypothetical CouchDB 
version that is built on top of it: yes, absolutely. This is knowledge this 
project will have to acquire and put into community knowledge by way of 
documentation and discussions on these mailing lists and the issue tracker, 
just like we do today. There’s going to be a separate thread about this as 
well. In all fairness, 90% of CouchDB users can treat CouchDB as a black box 
and don’t have to dive below the HTTP API, or not a lot further, when, say, 
thinking about backups, but for the other 10%, they’ll eventually want to grasp 
the whole stack including FDB, and I’d personally consider it a part of this 
proposal to bring working knowledge of how to operate FDB into CouchDB communal 
knowledge.

So finally, as to whether this proposal will make it easier for folks to dive 
into the CouchDB source? — My gut feeling is yes, based on two aspects:

1. the extended transparency of IBM feature design work within the Apache 
CouchDB community, e.g. we’ll get in on the ground level not only for new 
features, but for this whole endeavour, rather than being presented with the 
final source code that we then can try and run and understand ourselves. So 
extra transparency is baked into this discussion.

2. The FoundationDB documentation and community are already more exhaustive 
than the CouchDB storage (bits on disk) and distribution (bits on the network) 
code has ever been. You can already read up on FDB semantics, and the pending 
technical discussion here will outline in very close detail how it’s planned to 
map CouchDB semantics onto the FDB primitives.

I hope that covers most of it, at least from my PoV as not an IBM employee or 
champion of this proposal.

Best
Jan
—

> On 24. Jan 2019, at 00:27, Michael Fair <[email protected]> wrote:
> 
> As someone who isn't as directly involved in the release-to-release
> development, would a move like this make it easier or harder for new/casual
> community members to get up to speed/understand what's going on?
> 
> As projects grow and mature, the introductory learning curve tends to get
> steeper, making it harder for people who didn't "grow up with the project"
> to grok the project as a whole thing.  Not complaining, just identifying.
> 
> Is this proposal suggesting something more akin to a storage layer
> separation (making it somewhat easier to identify the separate component
> layers and experiment with different backends) or more like a storage
> technology change (where any experimenter would first have to understand
> how FDB semantics are different from File I/O semantics)?
> 
> All in all it sounds like a promising proposal.
> My first thought was something like "Hmm, is this different than simply
> adding a 'Couch Replication Protocol' module to FoundationDB? Probably, or
> they wouldn't be proposing it this way"
> 
> Followed quickly by, "Okay, looks like I'll likely need to start learning
> FoundationDB now too if I really want to understand CouchDB's
> capabilities.  I've not really heard much/looked at it before..."
> 
> I don't think a new learning curve should dissuade people from adopting it,
> but as I haven't looked at the educational materials available, I can't
> speak to the level of "ownership" the general community would be able to
> keep.
> 
> My experience is, generally speaking, people simply avoid aspects of a
> project they don't feel competent in.  Leaving that work to those with
> stronger opinions/convictions/interest. And that the easier it is to
> independently "get up to speed" on that aspect of the project (reading a
> blog(s)/watching a video(s)/tracing code) the more likely an interested
> party is to contribute there.
> 
> It'd be great to find out that a consequence of this move makes it easier
> for interested people, still unfamiliar with CouchDB's internals, to get
> more involved because there were some great and easily accessible teaching
> materials...
> 
> This concept obviously isn't unique to this FDB proposal; nor is it
> advocating for or against; I guess it's just expressing a hope that the
> impact is made to also help those who would like to get started
> contributing to CouchDB in meaningful ways instead of them getting a new
> and more complicated third party tech dependency to go learn as well.
> 
> Mike
> 
> PS While I assume there's likely very clear answers, does this differ
> significantly from the idea of giving FoundationDB a Couch compatible web
> API interface?  Like instead of making FoundationDB "the storage backend"
> for Couch, why not add a Couch compatible web interface front end to
> FoundationDB?  Is there a lot of useful Couch code in between those two
> things?
> 
> 
> On Wed, Jan 23, 2019 at 12:20 PM Joan Touzet <[email protected]> wrote:
> 
>> Hi everyone,
>> 
>> As Jan mentions, the PMC has had a couple of weeks to prepare on this.
>> 
>> As a non-IBMer (though an ex-IBM-er and ex-Cloudant-er), I've had my
>> Apache PMC hat on the entire time, considering all of the things
>> that Jan mentions and more. My primary concern has been ensuring that,
>> should this go forward, what happens occurs in the project's best
>> interest.
>> 
>> During the analysis process I came up with 8 serious topics that we
>> need to sort out:
>> 
>> * RFC process - how major changes are proposed/designed/accepted,
>>                see new GitHub template for a preview on this
>> 
>> * Bylaws review - namely, should we insist on +1s from outside
>>                  your company for big things? Plus RFC/deprecations.
>> 
>> * Roadmap - we have a roadmap from ~24 months ago that represented
>>            our goals for CouchDB 2.x and 3.x. What happens to it?
>>            https://s.apache.org/couch2xroadmap
>> 
>> * Onboarding - better mentoring in The Apache Way and The CouchDB
>>               Way for new members (from IBM and elsewhere)
>> 
>> * (Re-)Branding - how do we differentiate between "CouchDB Classic"
>>                  and "New CouchDB" in a succinct and clear way?
>> 
>> * FoundationDB - all the non-technical aspects. Review of _their_
>>                 project governance, cross-project pollination, us
>>                 learning the core and pros/cons, identifying who
>>                 will actually learn that code base, and operational
>>                 considerations. Also: keeping this knowledge public
>>                 and not just "inside IBM's dev/ops teams".
>> 
>> * Proj. Mgmt. - Obviously IBM will have a PM involved. We should too.
>>                Reviewing process/procedure and ensuring a smooth
>>                collaboration is critical. IBM doesn't get to just
>>                throw code over the wall at us. Similarly, should we
>>                choose to work on proposed features, or stuff from
>>                the roadmap, we need to be able to cooperate. No
>>                cookie licking allowed![*]
>> 
>> * Tech deep dives - this will actually be many, many threads I expect,
>>                    including everyone's favourite on release mgmt :P
>> 
>> New threads will be started on these topics by PMC members over the
>> coming days (but not all at once, so everyone has time to reflect and
>> respond.)
>> 
>> My initial take on the proposal: it's GOOD that we're finally
>> addressing some of the problems that 2.x brought to the table, and if
>> this is the best way to do so, then so be it. I want to know more
>> about the technical details, and I want to see a more formal RFC before
>> voting on it, though.
>> 
>> -Joan 'And now for something completely different...' Touzet
>> 
>> [*] http://communitymgt.wikia.com/wiki/Cookie_Licking
>> 
>> 
>> ----- Original Message -----
>>> From: "Jan Lehnardt" <[email protected]>
>>> To: "CouchDB Developers" <[email protected]>
>>> Sent: Wednesday, January 23, 2019 8:33:30 AM
>>> Subject: Re: [DISCUSS] Rebase CouchDB on top of FoundationDB
>>> 
>>> Hi Bob,
>>> 
>>> this is all very exciting!
>>> 
>>> First up, full disclosure, the CouchDB PMC has had about two weeks to
>>> think about this already, so if any of the following doesn’t sound
>>> like a knee-jerk reaction, that’s why.
>>> 
>>> I’m personally tentatively optimistic about this proposal and I’m
>>> willing to work through all open questions from governance,
>>> contribution management to the technical bits to see if we as the
>>> CouchDB project arrive at a point where we are comfortable going
>>> down this path.
>>> 
>>> The PMC has already identified a set of discussion areas for this
>>> dev@ mailing list to go through before any definite decision can be
>>> made. Separate emails for those discussions are going to be posted
>>> on this list shortly, so I won’t go into further detail here.
>>> 
>>> If anyone sees a need for discussion beyond the threads that will
>>> appear here, please speak up at your earliest convenience. This
>>> proposal would mean a big step for our project, and we must make
>>> sure to hear all voices.
>>> 
>>> Once we’ve gone through all this, the resulting answers to all the
>>> open questions coming up will end up in a consensus finding process
>>> on this mailing list, which will signify the final project decision.
>>> 
>>> * * *
>>> 
>>> That said, I’d like to highlight one of these topics: IBM/Cloudant’s
>>> contributions going forward.
>>> 
>>> Looking at how 2.0 came to be, the contributions were mostly taken on
>>> good faith (and legal review), and from the trust Cloudant built up
>>> operating a large number of large instances of clusters of what
>>> would eventually become CouchDB 2.0. It has clearly paid off for
>>> CouchDB and our current level of success wouldn’t be without
>>> IBM/Cloudant.
>>> 
>>> However, some of the ways we work with the IBM team leave things to
>>> be desired. Specifically, the Apache CouchDB community is frequently
>>> not involved in design discussions around new features. Those happen
>>> inside IBM and we “only” get a PR that then goes through the regular
>>> review process. Again, this has served us well, but we can do even
>>> better, so I’d like to take the opportunity of this larger proposal
>>> to suggest we actually do better. As promised, a more detailed
>>> thread about this is going to come up, and it’ll be the right place
>>> to go through the minutiae of this.
>>> 
>>> With this structural change, I believe we are in a great position to
>>> work through the details of this proposal and the subsequent design
>>> and engineering steps.
>>> 
>>> * * *
>>> 
>>> Finally, I want to reiterate Bob’s point: while this proposal is
>>> largely driven by IBM, IBM has no power to unilaterally force the
>>> CouchDB project to accept this proposal and they have already
>>> signalled and worked towards making this a mutually beneficial
>>> endeavour. The CouchDB project has different objectives from IBM and
>>> it is up to us to come up with a proposal that satisfies all of our
>>> objectives as well as IBMs, should this motion pass.
>>> 
>>> Best
>>> Jan
>>> —
>>> 
>>> 
>>>> On 23. Jan 2019, at 11:00, Robert Samuel Newson
>>>> <[email protected]> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> CouchDB 2.0 introduced clustering; the ability to scale a single
>>>> database across multiple nodes, increasing both the maximum size
>>>> of a database and adding native fault-tolerance. This welcome and
>>>> considerable step forward was not without its trade-offs. In the
>>>> years since 2.0 was released, users frequently encounter the
>>>> following issues as a direct consequence of the 2.0 clustering
>>>> approach:
>>>> 
>>>> 1. Conflict revisions can be created on normal concurrent updates
>>>> issued to a single database, since each replica of a database
>>>> shard independently chooses whether to accept a given update, and
>>>> all replicas will eventually propagate updates that any one of
>>>> them has chosen to accept.
>>>> 2. Secondary indexes ("views") do not scale the same way as
>>>> document lookups, as they are sharded by doc id, not emitted view
>>>> key (thus forcing a consultation of all shard ranges for each
>>>> query).
>>>> 3. The changes feed is no longer totally ordered and, worse, could
>>>> replay earlier changes in the event of a node failure (even a
>>>> temporary one).
>>>> 
>>>> The idea is to use FoundationDB as the new CouchDB foundational
>>>> layer, letting it take care of data storage and placement. An
>>>> introduction to FoundationDB would take up too much space here so
>>>> I will summarise it as a highly scalable ordered key-value store
>>>> with transactional semantics, provides strong consistency, scaling
>>>> from a single node to many. It is licensed under the ASLv2 but is
>>>> not an Apache project.
>>>> 
>>>> By using FoundationDB we can solve all three of the problems listed
>>>> above and deliver semantics much closer to CouchDB 1.x's behaviour
>>>> while improving upon the scalability advantages that 2.0
>>>> introduced. The essential character of CouchDB would be preserved
>>>> (MVCC for documents, replication between CouchDB databases) but
>>>> the underlying plumbing would change significantly. In addition,
>>>> this new foundation will allow us to add long wished-for features
>>>> more easily. For example, multi-document transactions become
>>>> possible, as does efficient field-level reading and writing. A
>>>> further thought is the ability to update views transactionally
>>>> with the database update.
>>>> 
>>>> For those familiar with the CouchDB 2.0 architecture, the proposal
>>>> is, in effect, to change all the functions in fabric.erl so that
>>>> they work against a (possibly remote) FoundationDB cluster instead
>>>> of the current implementation of calling into the original CouchDB
>>>> 1.x code (couch_btree, couch_file, etc).
>>>> 
>>>> This is a large change and, for full disclosure, the IBM Cloudant
>>>> team are proposing it. We have done our due diligence in
>>>> investigating FoundationDB as well as detailed investigation into
>>>> how CouchDB semantics would be built on top of FoundationDB. Any
>>>> and all decisions on that must take place here on the CouchDB
>>>> developer mailing list, of course, but we are confident that this
>>>> is feasible.
>>>> During those investigations we have identified a small number of
>>>> CouchDB features that we do not yet see a way to do on
>>>> FoundationDB, the main one being custom (Javascript) reduces. This
>>>> is a direct consequence of no longer rolling our own persistence
>>>> layer (couch_btree and friends) and would likely apply to any
>>>> alternative technology.
>>>> 
>>>> I think this would be a great advance for CouchDB, preserving what
>>>> makes CouchDB special but taking advantage of the superbly
>>>> engineered FoundationDB software at the bottom of the stack.
>>>> 
>>>> Regards,
>>>> Robert Newson
>>> 
>>> --
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/
>>> 
>>> 
>> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Reply via email to