Re: Attachment Replication Problem - Bug Found

Antony Blakey Sun, 17 May 2009 17:45:46 -0700


On 17/05/2009, at 9:27 PM, Adam Kocoloski wrote:

On May 16, 2009, at 8:30 PM, Antony Blakey wrote:
On 17/05/2009, at 12:09 AM, Adam Kocoloski wrote:
So, I think there's still some confusion here. By "openconnections" do you mean TCP connections to the source? Thatnumber is never higher than 10. ibrowse does pipeline requests onthose 10 connections, so there could be as many as 1000simultaneous HTTP requests. However, those requests complete assoon as the data reaches the ibrowse client process, so in factthe number of outstanding request during replication is usuallyvery small. We're not doing flow control at the TCP socket layer.
OK, I understand that now. That means that a document with > 1000attachments can't be replicated because ibrowse will never sendibrowse_async_headers for the excess attachments toattachment_loop, which needs to happen for every attachment beforeany of the data is read by doc_flush_binaries. Which is to say thatevery document attachment needs to start e.g. receive headers,before any attachment bodies are consumed.
Not quite. So, this discussion is going to quickly become even moreconfusing because as of yesterday attachments are downloaded ondedicated connections outside the load-balanced connection pool.For the sake of argument let's stick with the behavior as of 2 daysago at first.
I keep coming back to this key point: _ibrowse has no flowcontrol_. It doesn't matter whether we consume theibrowse_async_headers message in the attachment receiver or not;ibrowse is still going to immediately send all thoseibrowse_async_response messages our way.

Sure, my point was that once the queue is full it won't send theibrowse_async_headers (because it will never start the connection). Ididn't realise that it would fail before that (as you explain below).I was assuming it would just block. Hence all my previous comments.

Now, your point about limits on the number of attachments in adocument is a good one. What I imagine would happen is the following:
1) couch_rep spawns off 1000+ attachment requests to ibrowse for asingle document2) ibrowse starts sending back {error, retry_later} responses whenthe queue is full3) the attachment receiver processes start exiting withattachment_request_failed4) couch_rep traps the exits and reboots the document enumeratorstarting at current_seq
5) repeat
Obviously this is not a good situation. Now, I mentioned earlierthat as of yesterday the attachment downloads are each done ondedicated connections. I pulled them out of the connection pool sothat a document download didn't get stuck behind a giant attachmentdownload (the end result would be one way to make couch run out ofmemory). This means that the max_connections x max_pipeline doesn'tapply to attachments. Of course, using dedicated connections hasits own scalability problems. Setting up and tearing down all ofthose connections for the "lots of small attachments" caseintroduces a significant cost, and eventually we could have so manyconnections in TIME_WAIT that we run out of ephemeral ports.

That new scalability problem is what I thought the original problemwas with ibrowse before I learnt it had a pool.

A better solution might be to have a separate load-balancedconnection pool just for attachments. We'd have to exercise somecare not to retry attachment requests on a connection that alreadyhas requests in the pipeline.
In my case, I have some large attachments and unreliable links, soI'm partial to a solution that allows progress even of partialattachments during link failure. We could get this by not delayingthe attachments, and buffering them to disk, using range requestson the get for partial downloads. This would solve some problemsbecause it starts with the requirement to always make progress,never redoing work. This seems like it could be done reasonablytransparently just by modifying the attachment download code.
I definitely like the idea of Range support for making progress inthe event of link failure. In theory, it would be possible to buildthis into ibrowse so we could transparently use it for very largedocuments as well.
I'm not absolutely opposed to saving attachments to temporary fileson disk, but I'd prefer to exhaust in-memory options first.

I'm pretty sure that the only scalable solution that will handledocuments with significant numbers of attachments is to avoid havingall the attachments be in-progress downloading before the document iswritten e.g. either buffering to disk or a more radical mod ofallowing attachments to be written before the document, which I guessis not going to happen. And once you allow buffering to disk as a lastresort, you may as well use it as the default mechanism. Apart fromanything else, it's a good basis for partial attachment downloadrestart.

I'm wondering if it's worth exhausting in-memory options if diskbuffering is absolutely required for at least one use case?

The problem I see with building it into ibrowse is the requirement toinject the restart/file management/expiration policies into ibrowse.


Cheers,

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

In anything at all, perfection is finally attained not when there isno longer anything to add, but when there is no longer anything totake away.

  -- Antoine de Saint-Exupery

Re: Attachment Replication Problem - Bug Found

Reply via email to