Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Michael Fair Thu, 24 Jan 2019 03:38:25 -0800

Wow, I think I get it, mostly; I still haven't read the FDB docs yet, but I
grok the replies to Will's and my email, and it sounds like FoundationDB
has done some really good underlying work where CouchDB could, in a sense,
become an advocate project for FDB's utility.


TLDR; +1 from me. :-)
Is this actually worthy of a 3.0 moniker; it seems like it could be
(breaking changes and dropping 1.X compatibility)?


Some general, higher level, thoughts (that probably mimic what you guys
have already been thinking):
1) I believe CouchDB, the software project, needs a growth path for adding
features somewhat organically.  I haven't felt this has been the case
historically when it comes to the wire and data storage aspects of the
project.  There's been a few ideas I've wanted to experiment with in the
replication protocol.

For example:
* auto-resolving/merging conflict branches when we can tell/detect that the
heads of two conflicting branches contain an identical document;
* adding some version of a binary JSON encoding to reduce network
utilization;
* creating some kind of a JSON diff ability to transmit changes between
document revisions (that is also binary-able);
* experimenting with encrypted and private data stores to create a
decentralized vault where secret data could be shared with other parties
via CouchDB without revealing the secret data while in transit;
* "object type" based HTML template forms that could be directly "filled
out" by documents / "modified" by browsers

Overall it sounds to me like the FoundationDB changes/advantages Jan and
Robert described so far collectively point in a direction that would,
generally speaking, make my life easier at approaching at least some of
these ideas.


2) I like the idea of "removing code" from Couch where it makes sense like
this.  The Apache CouchDB project to me has always, in a sense, been more
representative of the replication wire protocol and replication semantics
than the erlang software project.  I really enjoy the idea that other
projects can incorporate their own "Couch Compatible Replication Layer" and
use the CouchDB software project as their de facto test software.
"Outsourcing" the KV Store work to a project like FDB where there are other
people who enjoy focusing on that specific aspect of the problem leaves the
Couch folks free to focus on higher level features; which I believe is a
wonderful thing and exactly the right direction the whole project should
generally be going.  I think of Couch as a very "end user facing" project,
and as such, I think it benefits more than it risks (and I understand the
risks creating outside party dependencies), by building this kind of
outsourced dependency to a project that demonstrates; (1) technical
competence, (2) a fair bit of maturity, (3) decent docs to help ourselves
navigate "their world", and (4) a willingness to be responsive to CouchDB
project requests/feedback/contributions.  It sounds like the FDB project
scores well on all four of those points.  Having community members with
feet/experience in both projects is a huge help/bonus.


3) This also seems to enable some ideas that I've long wanted to try but
never thought I could do because of CouchDB's document storage design;
index/explore the db using the Neo4j graph database and syncing docs using
the IPLD semantics within the IPFS project (very personal to me I know, but
still, it's nice to see the ideas look more promising to me).  I'm
personally a really big fan of the "multiple master copies at distributed
locations" aspect to CouchDB; more so than the "single
distributed/sharded/parallel multiserver database" aspect.  I understand
there are more immediate, lucrative applications to the local multiserver,
larger database aspect; so I'm excited to see that work done too. It's
simply not the aspect of the project that really catches my interests.



Thanks very much Jan/Robert for hearing what I had to say and giving great
and meaningful replies!

Jan, I really appreciate you commenting that you understood my concerns
about taking seriously the need to really incorporate the technology into
the DNA of the community and that's what you expect to see successfully
happen; and Robert, likewise, for adding that the existing FDB community
could very well be interested in Couch as a "front end test project" to
give practical application and meaning to some of their work.

While I'm not going to go so far as say I can personally vouch for the
proposals success; I do trust when Jan says "it looks like we can pull this
off reasonably easy enough by taking on a few growing pains" that it's
true; and coupled with the clear amount of behind-the-scenes forethought
that went into it; I really like it.


Thanks!
Mike


On Thu, Jan 24, 2019 at 2:20 AM Robert Samuel Newson <rnew...@apache.org>
wrote:

> Hi,
>
> Thank you for the in-depth response, that’s exactly what the PMC is
> looking for.
>
> You are comprehending the nature and magnitude of the change correctly
> here, where you suggest we could “just” write a new CouchDB Layer on top of
> FoundationDB and achieve a similar effect. However, the nature of software
> and software development really speaks against doing it that way, in my
> opinion. In 2.0 we introduced an abstraction between the HTTP processing
> layer and the lower plumbing of b-trees and file I/O with the “fabric”
> application. This was essential to introduce clustering but it was a
> significant architectural improvement in its own right. By reimplementing
> below that line we can be more confident that we have preserved all the
> necessary parts of the CouchDB API and experience. Additionally, separate
> applications like the replicator and job scheduler can remain as they are.
> A lot of the existing code will remain as-is, or have minor changes or
> cleanup (the “local” mode for replication, unreachable since 2.0, can
> finally be excised, for example).
>
> To your other point, I remember the difficulty I first had when looking at
> CouchDB. It’s in Erlang, which I’d not used before, and there is a lot of
> subtle and tricky code at the lower tiers (see couch_key_tree.erl or
> couch_btree.erl). By using FoundationDB for that instead I hope we
> _increase_ the comprehensibility of CouchDB, as what remains will be its
> essential nature and not the important but ancillary plumbing below. The
> increased public development activity on CouchDB, the size of the ambition
> here, and some cross-pollinating interest from those who know or are
> interested in FoundationDB should, I hope, bring more active developers of
> all levels of experience and interest to our project.
>
> B.
>
> > On 23 Jan 2019, at 23:27, Michael Fair <mich...@daclubhouse.net> wrote:
> >
> > As someone who isn't as directly involved in the release-to-release
> > development, would a move like this make it easier or harder for
> new/casual
> > community members to get up to speed/understand what's going on?
> >
> > As projects grow and mature, the introductory learning curve tends to get
> > steeper, making it harder for people who didn't "grow up with the
> project"
> > to grok the project as a whole thing.  Not complaining, just identifying.
> >
> > Is this proposal suggesting something more akin to a storage layer
> > separation (making it somewhat easier to identify the separate component
> > layers and experiment with different backends) or more like a storage
> > technology change (where any experimenter would first have to understand
> > how FDB semantics are different from File I/O semantics)?
> >
> > All in all it sounds like a promising proposal.
> > My first thought was something like "Hmm, is this different than simply
> > adding a 'Couch Replication Protocol' module to FoundationDB? Probably,
> or
> > they wouldn't be proposing it this way"
> >
> > Followed quickly by, "Okay, looks like I'll likely need to start learning
> > FoundationDB now too if I really want to understand CouchDB's
> > capabilities.  I've not really heard much/looked at it before..."
> >
> > I don't think a new learning curve should dissuade people from adopting
> it,
> > but as I haven't looked at the educational materials available, I can't
> > speak to the level of "ownership" the general community would be able to
> > keep.
> >
> > My experience is, generally speaking, people simply avoid aspects of a
> > project they don't feel competent in.  Leaving that work to those with
> > stronger opinions/convictions/interest. And that the easier it is to
> > independently "get up to speed" on that aspect of the project (reading a
> > blog(s)/watching a video(s)/tracing code) the more likely an interested
> > party is to contribute there.
> >
> > It'd be great to find out that a consequence of this move makes it easier
> > for interested people, still unfamiliar with CouchDB's internals, to get
> > more involved because there were some great and easily accessible
> teaching
> > materials...
> >
> > This concept obviously isn't unique to this FDB proposal; nor is it
> > advocating for or against; I guess it's just expressing a hope that the
> > impact is made to also help those who would like to get started
> > contributing to CouchDB in meaningful ways instead of them getting a new
> > and more complicated third party tech dependency to go learn as well.
> >
> > Mike
> >
> > PS While I assume there's likely very clear answers, does this differ
> > significantly from the idea of giving FoundationDB a Couch compatible web
> > API interface?  Like instead of making FoundationDB "the storage backend"
> > for Couch, why not add a Couch compatible web interface front end to
> > FoundationDB?  Is there a lot of useful Couch code in between those two
> > things?
> >
> >
> > On Wed, Jan 23, 2019 at 12:20 PM Joan Touzet <woh...@apache.org> wrote:
> >
> >> Hi everyone,
> >>
> >> As Jan mentions, the PMC has had a couple of weeks to prepare on this.
> >>
> >> As a non-IBMer (though an ex-IBM-er and ex-Cloudant-er), I've had my
> >> Apache PMC hat on the entire time, considering all of the things
> >> that Jan mentions and more. My primary concern has been ensuring that,
> >> should this go forward, what happens occurs in the project's best
> >> interest.
> >>
> >> During the analysis process I came up with 8 serious topics that we
> >> need to sort out:
> >>
> >> * RFC process - how major changes are proposed/designed/accepted,
> >>                see new GitHub template for a preview on this
> >>
> >> * Bylaws review - namely, should we insist on +1s from outside
> >>                  your company for big things? Plus RFC/deprecations.
> >>
> >> * Roadmap - we have a roadmap from ~24 months ago that represented
> >>            our goals for CouchDB 2.x and 3.x. What happens to it?
> >>            https://s.apache.org/couch2xroadmap
> >>
> >> * Onboarding - better mentoring in The Apache Way and The CouchDB
> >>               Way for new members (from IBM and elsewhere)
> >>
> >> * (Re-)Branding - how do we differentiate between "CouchDB Classic"
> >>                  and "New CouchDB" in a succinct and clear way?
> >>
> >> * FoundationDB - all the non-technical aspects. Review of _their_
> >>                 project governance, cross-project pollination, us
> >>                 learning the core and pros/cons, identifying who
> >>                 will actually learn that code base, and operational
> >>                 considerations. Also: keeping this knowledge public
> >>                 and not just "inside IBM's dev/ops teams".
> >>
> >> * Proj. Mgmt. - Obviously IBM will have a PM involved. We should too.
> >>                Reviewing process/procedure and ensuring a smooth
> >>                collaboration is critical. IBM doesn't get to just
> >>                throw code over the wall at us. Similarly, should we
> >>                choose to work on proposed features, or stuff from
> >>                the roadmap, we need to be able to cooperate. No
> >>                cookie licking allowed![*]
> >>
> >> * Tech deep dives - this will actually be many, many threads I expect,
> >>                    including everyone's favourite on release mgmt :P
> >>
> >> New threads will be started on these topics by PMC members over the
> >> coming days (but not all at once, so everyone has time to reflect and
> >> respond.)
> >>
> >> My initial take on the proposal: it's GOOD that we're finally
> >> addressing some of the problems that 2.x brought to the table, and if
> >> this is the best way to do so, then so be it. I want to know more
> >> about the technical details, and I want to see a more formal RFC before
> >> voting on it, though.
> >>
> >> -Joan 'And now for something completely different...' Touzet
> >>
> >> [*] http://communitymgt.wikia.com/wiki/Cookie_Licking
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Jan Lehnardt" <j...@apache.org>
> >>> To: "CouchDB Developers" <dev@couchdb.apache.org>
> >>> Sent: Wednesday, January 23, 2019 8:33:30 AM
> >>> Subject: Re: [DISCUSS] Rebase CouchDB on top of FoundationDB
> >>>
> >>> Hi Bob,
> >>>
> >>> this is all very exciting!
> >>>
> >>> First up, full disclosure, the CouchDB PMC has had about two weeks to
> >>> think about this already, so if any of the following doesn’t sound
> >>> like a knee-jerk reaction, that’s why.
> >>>
> >>> I’m personally tentatively optimistic about this proposal and I’m
> >>> willing to work through all open questions from governance,
> >>> contribution management to the technical bits to see if we as the
> >>> CouchDB project arrive at a point where we are comfortable going
> >>> down this path.
> >>>
> >>> The PMC has already identified a set of discussion areas for this
> >>> dev@ mailing list to go through before any definite decision can be
> >>> made. Separate emails for those discussions are going to be posted
> >>> on this list shortly, so I won’t go into further detail here.
> >>>
> >>> If anyone sees a need for discussion beyond the threads that will
> >>> appear here, please speak up at your earliest convenience. This
> >>> proposal would mean a big step for our project, and we must make
> >>> sure to hear all voices.
> >>>
> >>> Once we’ve gone through all this, the resulting answers to all the
> >>> open questions coming up will end up in a consensus finding process
> >>> on this mailing list, which will signify the final project decision.
> >>>
> >>> * * *
> >>>
> >>> That said, I’d like to highlight one of these topics: IBM/Cloudant’s
> >>> contributions going forward.
> >>>
> >>> Looking at how 2.0 came to be, the contributions were mostly taken on
> >>> good faith (and legal review), and from the trust Cloudant built up
> >>> operating a large number of large instances of clusters of what
> >>> would eventually become CouchDB 2.0. It has clearly paid off for
> >>> CouchDB and our current level of success wouldn’t be without
> >>> IBM/Cloudant.
> >>>
> >>> However, some of the ways we work with the IBM team leave things to
> >>> be desired. Specifically, the Apache CouchDB community is frequently
> >>> not involved in design discussions around new features. Those happen
> >>> inside IBM and we “only” get a PR that then goes through the regular
> >>> review process. Again, this has served us well, but we can do even
> >>> better, so I’d like to take the opportunity of this larger proposal
> >>> to suggest we actually do better. As promised, a more detailed
> >>> thread about this is going to come up, and it’ll be the right place
> >>> to go through the minutiae of this.
> >>>
> >>> With this structural change, I believe we are in a great position to
> >>> work through the details of this proposal and the subsequent design
> >>> and engineering steps.
> >>>
> >>> * * *
> >>>
> >>> Finally, I want to reiterate Bob’s point: while this proposal is
> >>> largely driven by IBM, IBM has no power to unilaterally force the
> >>> CouchDB project to accept this proposal and they have already
> >>> signalled and worked towards making this a mutually beneficial
> >>> endeavour. The CouchDB project has different objectives from IBM and
> >>> it is up to us to come up with a proposal that satisfies all of our
> >>> objectives as well as IBMs, should this motion pass.
> >>>
> >>> Best
> >>> Jan
> >>> —
> >>>
> >>>
> >>>> On 23. Jan 2019, at 11:00, Robert Samuel Newson
> >>>> <rnew...@apache.org> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> CouchDB 2.0 introduced clustering; the ability to scale a single
> >>>> database across multiple nodes, increasing both the maximum size
> >>>> of a database and adding native fault-tolerance. This welcome and
> >>>> considerable step forward was not without its trade-offs. In the
> >>>> years since 2.0 was released, users frequently encounter the
> >>>> following issues as a direct consequence of the 2.0 clustering
> >>>> approach:
> >>>>
> >>>> 1. Conflict revisions can be created on normal concurrent updates
> >>>> issued to a single database, since each replica of a database
> >>>> shard independently chooses whether to accept a given update, and
> >>>> all replicas will eventually propagate updates that any one of
> >>>> them has chosen to accept.
> >>>> 2. Secondary indexes ("views") do not scale the same way as
> >>>> document lookups, as they are sharded by doc id, not emitted view
> >>>> key (thus forcing a consultation of all shard ranges for each
> >>>> query).
> >>>> 3. The changes feed is no longer totally ordered and, worse, could
> >>>> replay earlier changes in the event of a node failure (even a
> >>>> temporary one).
> >>>>
> >>>> The idea is to use FoundationDB as the new CouchDB foundational
> >>>> layer, letting it take care of data storage and placement. An
> >>>> introduction to FoundationDB would take up too much space here so
> >>>> I will summarise it as a highly scalable ordered key-value store
> >>>> with transactional semantics, provides strong consistency, scaling
> >>>> from a single node to many. It is licensed under the ASLv2 but is
> >>>> not an Apache project.
> >>>>
> >>>> By using FoundationDB we can solve all three of the problems listed
> >>>> above and deliver semantics much closer to CouchDB 1.x's behaviour
> >>>> while improving upon the scalability advantages that 2.0
> >>>> introduced. The essential character of CouchDB would be preserved
> >>>> (MVCC for documents, replication between CouchDB databases) but
> >>>> the underlying plumbing would change significantly. In addition,
> >>>> this new foundation will allow us to add long wished-for features
> >>>> more easily. For example, multi-document transactions become
> >>>> possible, as does efficient field-level reading and writing. A
> >>>> further thought is the ability to update views transactionally
> >>>> with the database update.
> >>>>
> >>>> For those familiar with the CouchDB 2.0 architecture, the proposal
> >>>> is, in effect, to change all the functions in fabric.erl so that
> >>>> they work against a (possibly remote) FoundationDB cluster instead
> >>>> of the current implementation of calling into the original CouchDB
> >>>> 1.x code (couch_btree, couch_file, etc).
> >>>>
> >>>> This is a large change and, for full disclosure, the IBM Cloudant
> >>>> team are proposing it. We have done our due diligence in
> >>>> investigating FoundationDB as well as detailed investigation into
> >>>> how CouchDB semantics would be built on top of FoundationDB. Any
> >>>> and all decisions on that must take place here on the CouchDB
> >>>> developer mailing list, of course, but we are confident that this
> >>>> is feasible.
> >>>> During those investigations we have identified a small number of
> >>>> CouchDB features that we do not yet see a way to do on
> >>>> FoundationDB, the main one being custom (Javascript) reduces. This
> >>>> is a direct consequence of no longer rolling our own persistence
> >>>> layer (couch_btree and friends) and would likely apply to any
> >>>> alternative technology.
> >>>>
> >>>> I think this would be a great advance for CouchDB, preserving what
> >>>> makes CouchDB special but taking advantage of the superbly
> >>>> engineered FoundationDB software at the bottom of the stack.
> >>>>
> >>>> Regards,
> >>>> Robert Newson
> >>>
> >>> --
> >>> Professional Support for Apache CouchDB:
> >>> https://neighbourhood.ie/couchdb-support/
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Reply via email to