On May 16, 2009, at 8:30 PM, Antony Blakey wrote:

On 17/05/2009, at 12:09 AM, Adam Kocoloski wrote:

So, I think there's still some confusion here. By "open connections" do you mean TCP connections to the source? That number is never higher than 10. ibrowse does pipeline requests on those 10 connections, so there could be as many as 1000 simultaneous HTTP requests. However, those requests complete as soon as the data reaches the ibrowse client process, so in fact the number of outstanding request during replication is usually very small. We're not doing flow control at the TCP socket layer.

OK, I understand that now. That means that a document with > 1000 attachments can't be replicated because ibrowse will never send ibrowse_async_headers for the excess attachments to attachment_loop, which needs to happen for every attachment before any of the data is read by doc_flush_binaries. Which is to say that every document attachment needs to start e.g. receive headers, before any attachment bodies are consumed.

Not quite. So, this discussion is going to quickly become even more confusing because as of yesterday attachments are downloaded on dedicated connections outside the load-balanced connection pool. For the sake of argument let's stick with the behavior as of 2 days ago at first.

I keep coming back to this key point: _ibrowse has no flow control_. It doesn't matter whether we consume the ibrowse_async_headers message in the attachment receiver or not; ibrowse is still going to immediately send all those ibrowse_async_response messages our way.

Now, your point about limits on the number of attachments in a document is a good one. What I imagine would happen is the following:

1) couch_rep spawns off 1000+ attachment requests to ibrowse for a single document 2) ibrowse starts sending back {error, retry_later} responses when the queue is full 3) the attachment receiver processes start exiting with attachment_request_failed 4) couch_rep traps the exits and reboots the document enumerator starting at current_seq
5) repeat

Obviously this is not a good situation. Now, I mentioned earlier that as of yesterday the attachment downloads are each done on dedicated connections. I pulled them out of the connection pool so that a document download didn't get stuck behind a giant attachment download (the end result would be one way to make couch run out of memory). This means that the max_connections x max_pipeline doesn't apply to attachments. Of course, using dedicated connections has its own scalability problems. Setting up and tearing down all of those connections for the "lots of small attachments" case introduces a significant cost, and eventually we could have so many connections in TIME_WAIT that we run out of ephemeral ports.

A better solution might be to have a separate load-balanced connection pool just for attachments. We'd have to exercise some care not to retry attachment requests on a connection that already has requests in the pipeline.

In my case, I have some large attachments and unreliable links, so I'm partial to a solution that allows progress even of partial attachments during link failure. We could get this by not delaying the attachments, and buffering them to disk, using range requests on the get for partial downloads. This would solve some problems because it starts with the requirement to always make progress, never redoing work. This seems like it could be done reasonably transparently just by modifying the attachment download code.

I definitely like the idea of Range support for making progress in the event of link failure. In theory, it would be possible to build this into ibrowse so we could transparently use it for very large documents as well.

I'm not absolutely opposed to saving attachments to temporary files on disk, but I'd prefer to exhaust in-memory options first.

Cheers, Adam

Reply via email to