Heya list,
this a conversation that went off-list that might
be of value here and could be continued here.



Begin forwarded message:

[...]
If you have the time to write, I do have a few general questions for you. I'll try to quickly describe our situation here and where I think CouchDB might fit, at least initially. We have lots of fun trying to scale everything around here with our fast growth. Our local dev teams here in San Antonio work on a ticketing system with millions of tickets using Postgres and Python mostly. We're also just deploying a Xapian search index infrastructure to try to offload some of the workload now weighing heavily on Postgres.

I'm specifically looking at CouchDB for a workflow system addition, where the workflow definitions themselves and the instances of those workflows in action will be stored.

I know of a research project here near Berlin
that deals with exactly this, a workflow system,
possibly built on top of CouchDB (they've been
researching for a while now and only recently
discovered CouchDB, but they had so much
issues with getting things up on Postgres that
they immediately considered CouchDB). Anyway,
maybe I put you guys in contact?

[Hagen, that is for you.]


The activity would somewhat mirror our ticketing system, though not in data size volume, certainly in data change volume. I can get better stats for you later, but I would guess 2-4 thousand new workflow instances a day and probably that many older workflow instances updated per day as well. Likely, there'd be about 50 thousand "recently updated (within a few weeks)" workflow instances and eventually a few million inactive instances for archival querying. I'm afraid I can't even make a good guess at the read volume right now. The scary part for us is that our volume has been doubling each year so far.

Scary indeed, but good to know ;)


I know CouchDB has been built for scalability, but what sorts of use cases have you guys tried out, especially with regard to large data sets? What other items do you think might affect the system I generally described above?

[...]

The core feature to scale CouchDB is replication.
That is an asynchronous N-directional, rsync-like
operation to bring N nodes on the same level
data-wise.

This allows you to build any combination of
master-slave(-slave-slave…), master-master
and N-master replication topologies. With the latter
effectively being a distributed peer to peer replication
that even works with nodes only occasionally being
online.

CouchDB uses a HTTP API which allows you to use
all the usual suspects (proxies, load balancers etc.)
in front of it.

Since multi-master replication helps you only with
scaling writes and master-slave replication scaling
reads, there's nothing to help with large data-sets.
Except for large drives in each node :)

What we (Damien) have in mind for post 1.0 is
built in database partitioning/sharding that would
split up a DB onto multiple nodes and handle
requests automagically for you. For now, you would
have to built such a system with an intermediate
HTTP proxy service that would decide which
documents go where and vice versa (that is,
it would have to do the hashing and distribution).

That said, a few million workflows, both live and
archived should not be a problem for a single
Node as far as I can see. You might want to
slave-out reads, if there're more coming in than
your disk I/O can handle. CouchDB should not
be the bottleneck here. Again, a reverse proxy
in front of CouchDB should help a great deal.

CouchDB is designed to handle a lot (as in
A LOT) of concurrent requests instead of
optimising for single query speed. Single queries
are still reasonably fast, but a RDBMS is surely
faster. But we can easily handle 10x and more
concurrent requests than the average RDBMS
(for reads that is).

A little number bragging: My Mac Mini that I use
as a workstation does at max around ~100 random
reads per second on MySQL. With CouchDB I can
go to ~1000.

That is with an unoptimised CouchDB. CouchDB
does not yet perform any sort of caching and
we didn't do any profiling, so things will speed
up significantly before 1.0.


How do you envision data distributed over the world (both reading and updating)?

Have at least a single master server (and a backup)
in each location. This makes each location independent
of anyone else in case the Internet goes away.

Additionally, you can have these master servers
replicate with each other. So your Japan office
can operate on the US offices data without round-
tripping to the US.


If replication, how fast would that replication be (we have fall over support structures where people in different countries can help teams in other countries)?

Replication works on the document level at
the moment (with attribute-level in the future)
and works in the way that only the documents
that were added, have changed or were deleted
get replicated. That is, only diffs are exchanged
and eventually all participating DBs end up with the
same data. How fast that is depends on the amount
of data, number of changes and your connection
speed. At the moment operations are not batched
and single documents are fetched and applied
and that slows us down a bit. We have some
plans to further optimise replication by fixing
this and other things.


What sort of support structure do you envision for CouchDB?

What do you mean here? Support for the
'product' CouchDB? This is open source so
you are on your own :) We have a lively, small,
yet growing community that helps out each
other on IRC and our mailing lists. Apart from
that, I'm a freelancer and available for money
:) There might be others' too.


You mentioned hitting beta this summer, is that still on track? Do you expect to hit 1.0 by the end of the year, or something recommended for production systems? I know these time lines are impossible to really know, but I have to ask. =)

We are still on track with this, yea, but I won't
promise anything :) One thing to point out though
is that we have the alpha label, because we don't
yet have all the features we'd like to see in 1.0.
Those who are in are relatively stable and we haven't
gotten a lot of serious problem reports.


Will authentication and permission schemes be part of 1.0?

We will have validation, that is a mechanism that can
deny a write on terms you define. That is not exactly
authentication, and any permission system, you'd
have to build on top of that. Authentication will be a
post 1.0 and we suggest using an HTTP proxy to
solve that for you.

[...]

Cheers
Jan
--

Reply via email to