On May 16, 2009, at 8:30 PM, Antony Blakey wrote:
On 17/05/2009, at 12:09 AM, Adam Kocoloski wrote:
So, I think there's still some confusion here. By "open
connections" do you mean TCP connections to the source? That
number is never higher than 10. ibrowse does pipeline requests on
those 10 connections, so there could be as many as 1000
simultaneous HTTP requests. However, those requests complete as
soon as the data reaches the ibrowse client process, so in fact the
number of outstanding request during replication is usually very
small. We're not doing flow control at the TCP socket layer.
OK, I understand that now. That means that a document with > 1000
attachments can't be replicated because ibrowse will never send
ibrowse_async_headers for the excess attachments to attachment_loop,
which needs to happen for every attachment before any of the data is
read by doc_flush_binaries. Which is to say that every document
attachment needs to start e.g. receive headers, before any
attachment bodies are consumed.
Not quite. So, this discussion is going to quickly become even more
confusing because as of yesterday attachments are downloaded on
dedicated connections outside the load-balanced connection pool. For
the sake of argument let's stick with the behavior as of 2 days ago at
first.
I keep coming back to this key point: _ibrowse has no flow control_.
It doesn't matter whether we consume the ibrowse_async_headers message
in the attachment receiver or not; ibrowse is still going to
immediately send all those ibrowse_async_response messages our way.
Now, your point about limits on the number of attachments in a
document is a good one. What I imagine would happen is the following:
1) couch_rep spawns off 1000+ attachment requests to ibrowse for a
single document
2) ibrowse starts sending back {error, retry_later} responses when the
queue is full
3) the attachment receiver processes start exiting with
attachment_request_failed
4) couch_rep traps the exits and reboots the document enumerator
starting at current_seq
5) repeat
Obviously this is not a good situation. Now, I mentioned earlier that
as of yesterday the attachment downloads are each done on dedicated
connections. I pulled them out of the connection pool so that a
document download didn't get stuck behind a giant attachment download
(the end result would be one way to make couch run out of memory).
This means that the max_connections x max_pipeline doesn't apply to
attachments. Of course, using dedicated connections has its own
scalability problems. Setting up and tearing down all of those
connections for the "lots of small attachments" case introduces a
significant cost, and eventually we could have so many connections in
TIME_WAIT that we run out of ephemeral ports.
A better solution might be to have a separate load-balanced connection
pool just for attachments. We'd have to exercise some care not to
retry attachment requests on a connection that already has requests in
the pipeline.
In my case, I have some large attachments and unreliable links, so
I'm partial to a solution that allows progress even of partial
attachments during link failure. We could get this by not delaying
the attachments, and buffering them to disk, using range requests on
the get for partial downloads. This would solve some problems
because it starts with the requirement to always make progress,
never redoing work. This seems like it could be done reasonably
transparently just by modifying the attachment download code.
I definitely like the idea of Range support for making progress in the
event of link failure. In theory, it would be possible to build this
into ibrowse so we could transparently use it for very large documents
as well.
I'm not absolutely opposed to saving attachments to temporary files on
disk, but I'd prefer to exhaust in-memory options first.
Cheers, Adam