To start, I'd like to introduce myself as I've been off and on contributing in tiny ways to dev list activity and a little IRC chatter, but not super visible in the community.
My name is Randall and I'm a student at Brown University in Providence, Rhode Island, USA. I've got one more semester ahead of me in my undergraduate degree. I've been working with CouchDB on the Melkjug[1] project since June and have been intermittently active with couchdb-python[2] as a committer fixing small bugs. I'd like to create and polish a proposal this week for submission as a Google Summer of Code Project. To that end, this thread is to start the drafting process and determine a prioritized list of tasks and inter-task-dependencies required to get a smooth clustering and partitioning experience in CouchDB supported. Skip the next section if you just want to read my questions and jump right into the discussion. Otherwise, here's a brief overview of background information: A clarification of terms: On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz <[email protected]> wrote: > I see partitioning and clustering as 2 different things. Partitioning is > data partitioning, spreading the data out across nodes, no node having the > complete database. Clustering is nodes having the same, or nearly the same > data (they might be behind on replicating changes, but otherwise they have > the same data). source: http://mail-archives.apache.org/mod_mbox/couchdb-dev/200902.mbox/%[email protected]%3e Existing partitioning proposal[3] on the wiki: http://wiki.apache.org/couchdb/Partitioning_proposal >From an e-mail between myself and Chris A on first steps: Chris wrote: >I think as far as writing goes, there's still more work to be done on >design, but there are some pieces that can be written first: > > * consistent hashing Erlang proxy (start out with HTTP) > * view merging across partitioned nodes > >These two can be run as their own software at first, so they can sit >in front of a cluster of CouchDB machines without any changes >happening to CouchDB. Once they work, they can be tied to CouchDB >using Erlang terms and IPC instead of JSON/HTTP. > >There are some design questions about a partitioned CouchDB that we >should probably take up on the list: > > * what about _all_docs and other node-global queries? > * does a cluster use a single seq-num or does each node have it's own? Finally, the great folks from Meebo have recently posted couchdb-lounge[4] which uses an nginx proxy to make a "partitioning/clustering framework for CouchDB". The questions we should address are prioritized here: 1) What's required to make CouchDB a full OTP application? Isn't it using gen_server already? 2) What about _all_docs and seq-num? 3) Can we agree on a proposed solution to the layout of partition nodes? I like the tree solution, as long as it is extremely flexible wrt tree depth. 4) Should the consistent hashing algorithm map ids to leaf nodes or just to children? I lean toward children because it encapsulates knowledge about the layout of subtrees at each tree level. Submissions for GSoC are due by Friday so I would appreciate any help in polishing a proposal that will best serve the needs of the CouchDB community. Hopefully this generates some initial discussion that will lead me to a draft proposal in the next couple of days which I will post for revision and comment until I submit it at the end of the week. Thanks in advance, Randall [1] http://www.openplans.org/projects/melkjug/project-home [2] http://code.google.com/p/couchdb-python/ [3] http://wiki.apache.org/couchdb/Partitioning_proposal [4] http://code.google.com/p/couchdb-lounge/
