CouchDB Cluster/Partition GSoC

Randall Leeds Sun, 29 Mar 2009 18:00:05 -0700

To start, I'd like to introduce myself as I've been off and on contributing
in tiny ways to dev list activity and a little IRC chatter, but not super
visible in the community.

My name is Randall and I'm a student at Brown University in Providence,
Rhode Island, USA. I've got one more semester ahead of me in my
undergraduate degree. I've been working with CouchDB on the Melkjug[1]
project since June and have been intermittently active with
couchdb-python[2] as a committer fixing small bugs.

I'd like to create and polish a proposal this week for submission as a
Google Summer of Code Project.

To that end, this thread is to start the drafting process and determine a
prioritized list of tasks and inter-task-dependencies required to get a
smooth clustering and partitioning experience in CouchDB supported.

Skip the next section if you just want to read my questions and jump right
into the discussion.

Otherwise, here's a brief overview of background information:

A clarification of terms:

On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz <[email protected]> wrote:
> I see partitioning and clustering as 2 different things. Partitioning is
> data partitioning, spreading the data out across nodes, no node having the
> complete database. Clustering is nodes having the same, or nearly the same
> data (they might be behind on replicating changes, but otherwise they have
> the same data).
source:
http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%[email protected]%3e

Existing partitioning proposal[3] on the wiki:
http://wiki.apache.org/couchdb/Partitioning_proposal

>From an e-mail between myself and Chris A on first steps:

Chris wrote:
>I think as far as writing goes, there's still more work to be done on
>design, but there are some pieces that can be written first:
>
> * consistent hashing Erlang proxy (start out with HTTP)
> * view merging across partitioned nodes
>
>These two can be run as their own software at first, so they can sit
>in front of a cluster of CouchDB machines without any changes
>happening to CouchDB. Once they work, they can be tied to CouchDB
>using Erlang terms and IPC instead of JSON/HTTP.
>
>There are some design questions about a partitioned CouchDB that we
>should probably take up on the list:
>
> * what about _all_docs and other node-global queries?
> * does a cluster use a single seq-num or does each node have it's own?

Finally, the great folks from Meebo have recently posted couchdb-lounge[4]
which uses an nginx proxy to make a "partitioning/clustering framework for
CouchDB".

The questions we should address are prioritized here:
1) What's required to make CouchDB a full OTP application? Isn't it using
gen_server already?
2) What about _all_docs and seq-num?
3) Can we agree on a proposed solution to the layout of partition nodes? I
like the tree solution, as long as it is extremely flexible wrt tree depth.
4) Should the consistent hashing algorithm map ids to leaf nodes or just to
children? I lean toward children because it encapsulates knowledge about the
layout of subtrees at each tree level.

Submissions for GSoC are due by Friday so I would appreciate any help in
polishing a proposal that will best serve the needs of the CouchDB
community. Hopefully this generates some initial discussion that will lead
me to a draft proposal in the next couple of days which I will post for
revision and comment until I submit it at the end of the week.

Thanks in advance,
Randall

[1] http://www.openplans.org/projects/melkjug/project-home
[2] http://code.google.com/p/couchdb-python/
[3] http://wiki.apache.org/couchdb/Partitioning_proposal
[4] http://code.google.com/p/couchdb-lounge/

CouchDB Cluster/Partition GSoC

Reply via email to