Fw: CouchDB, RackLabs, MailTrust, etc.

Jan Lehnardt Tue, 29 Apr 2008 10:10:22 -0700

Heya list,
this a conversation that went off-list that might
be of value here and could be continued here.




Begin forwarded message:

[...]
If you have the time to write, I do have a few general questionsfor you. I'll try to quickly describe our situation here and whereI think CouchDB might fit, at least initially.We have lots of fun trying to scale everything around here with ourfast growth. Our local dev teams here in San Antonio work on aticketing system with millions of tickets using Postgres and Pythonmostly. We're also just deploying a Xapian search indexinfrastructure to try to offload some of the workload now weighingheavily on Postgres.
I'm specifically looking at CouchDB for a workflow system addition,where the workflow definitions themselves and the instances ofthose workflows in action will be stored.
I know of a research project here near Berlin
that deals with exactly this, a workflow system,
possibly built on top of CouchDB (they've been
researching for a while now and only recently
discovered CouchDB, but they had so much
issues with getting things up on Postgres that
they immediately considered CouchDB). Anyway,
maybe I put you guys in contact?


[Hagen, that is for you.]

The activity would somewhat mirror our ticketing system, though notin data size volume, certainly in data change volume. I can getbetter stats for you later, but I would guess 2-4 thousand newworkflow instances a day and probably that many older workflowinstances updated per day as well. Likely, there'd be about 50thousand "recently updated (within a few weeks)" workflow instancesand eventually a few million inactive instances for archivalquerying. I'm afraid I can't even make a good guess at the readvolume right now. The scary part for us is that our volume has beendoubling each year so far.


Scary indeed, but good to know ;)

I know CouchDB has been built for scalability, but what sorts ofuse cases have you guys tried out, especially with regard to largedata sets? What other items do you think might affect the system Igenerally described above?


[...]

The core feature to scale CouchDB is replication.
That is an asynchronous N-directional, rsync-like
operation to bring N nodes on the same level
data-wise.

This allows you to build any combination of
master-slave(-slave-slave…), master-master
and N-master replication topologies. With the latter
effectively being a distributed peer to peer replication
that even works with nodes only occasionally being
online.

CouchDB uses a HTTP API which allows you to use
all the usual suspects (proxies, load balancers etc.)
in front of it.

Since multi-master replication helps you only with
scaling writes and master-slave replication scaling
reads, there's nothing to help with large data-sets.
Except for large drives in each node :)

What we (Damien) have in mind for post 1.0 is
built in database partitioning/sharding that would
split up a DB onto multiple nodes and handle
requests automagically for you. For now, you would
have to built such a system with an intermediate
HTTP proxy service that would decide which
documents go where and vice versa (that is,
it would have to do the hashing and distribution).

That said, a few million workflows, both live and
archived should not be a problem for a single
Node as far as I can see. You might want to
slave-out reads, if there're more coming in than
your disk I/O can handle. CouchDB should not
be the bottleneck here. Again, a reverse proxy
in front of CouchDB should help a great deal.

CouchDB is designed to handle a lot (as in
A LOT) of concurrent requests instead of
optimising for single query speed. Single queries
are still reasonably fast, but a RDBMS is surely
faster. But we can easily handle 10x and more
concurrent requests than the average RDBMS
(for reads that is).

A little number bragging: My Mac Mini that I use
as a workstation does at max around ~100 random
reads per second on MySQL. With CouchDB I can
go to ~1000.

That is with an unoptimised CouchDB. CouchDB
does not yet perform any sort of caching and
we didn't do any profiling, so things will speed
up significantly before 1.0.

How do you envision data distributed over the world (both readingand updating)?


Have at least a single master server (and a backup)
in each location. This makes each location independent
of anyone else in case the Internet goes away.

Additionally, you can have these master servers
replicate with each other. So your Japan office
can operate on the US offices data without round-
tripping to the US.

If replication, how fast would that replication be (we have fallover support structures where people in different countries canhelp teams in other countries)?


Replication works on the document level at
the moment (with attribute-level in the future)
and works in the way that only the documents
that were added, have changed or were deleted
get replicated. That is, only diffs are exchanged
and eventually all participating DBs end up with the
same data. How fast that is depends on the amount
of data, number of changes and your connection
speed. At the moment operations are not batched
and single documents are fetched and applied
and that slows us down a bit. We have some
plans to further optimise replication by fixing
this and other things.

What sort of support structure do you envision for CouchDB?


What do you mean here? Support for the
'product' CouchDB? This is open source so
you are on your own :) We have a lively, small,
yet growing community that helps out each
other on IRC and our mailing lists. Apart from
that, I'm a freelancer and available for money
:) There might be others' too.

You mentioned hitting beta this summer, is that still on track? Doyou expect to hit 1.0 by the end of the year, or somethingrecommended for production systems? I know these time lines areimpossible to really know, but I have to ask. =)


We are still on track with this, yea, but I won't
promise anything :) One thing to point out though
is that we have the alpha label, because we don't
yet have all the features we'd like to see in 1.0.
Those who are in are relatively stable and we haven't
gotten a lot of serious problem reports.

Will authentication and permission schemes be part of 1.0?


We will have validation, that is a mechanism that can
deny a write on terms you define. That is not exactly
authentication, and any permission system, you'd
have to build on top of that. Authentication will be a
post 1.0 and we suggest using an HTTP proxy to
solve that for you.

[...]


Cheers
Jan
--

Fw: CouchDB, RackLabs, MailTrust, etc.

Reply via email to