Re: replication using _changes API

Chris Anderson Fri, 12 Jun 2009 14:56:50 -0700

On Fri, Jun 12, 2009 at 8:06 AM, Adam Kocoloski<[email protected]> wrote:
> On Jun 12, 2009, at 10:59 AM, Paul Davis wrote:
>
>> On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<[email protected]> wrote:
>>>
>>>
>>> On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:
>>>
>>>> Hi Damien, I'm not sure I follow.  My worry was that, if I built a
>>>> replicator which only queried _changes to get the list of updates, I'd
>>>> have
>>>> to be prepared to process a very large response.  I thought one smart
>>>> way to
>>>> process this response was to throttle the download at the TCP level by
>>>> putting the socket into passive mode.
>>>
>>> You will have a very large response, but you can stream it, processing
>>> one
>>> line at a time, then you discard the line and process the next. As long
>>> as
>>> the writer is using a blocking socket and the reader is only reading as
>>> much
>>> data as necessary to process a line, you never need to store much of the
>>> data in memory on either side. But it seems the HTTP client is buffering
>>> the
>>> data as it comes in, perhaps unintentionally.
>>>
>>> With TCP, the sending side will only send so much data before getting an
>>> ACK, acknowledgment that packets sent were actually received. When an ACK
>>> isn't received, the sender stops sending, and the TCP calls will block at
>>> the sender (or return an error if the socket is in non-blocking mode),
>>> until
>>> it gets a response or socket timeout.
>>>
>>> So if you have a non-buffering reader and a blocking sender, then you can
>>> stream the data and only relatively small amounts of data are buffered at
>>> any time. The problem is the reader in the HTTP client isn't waiting for
>>> the
>>> data to be demanded at all, instead as soon as data comes in, it sends it
>>> to
>>> a receiving erlang process. Erlang processes never block to receive
>>> messages, so there is no limit to the amount of data buffered. So if the
>>> Erlang process can't process the data fast enough, it starts getting
>>> buffered in it's mailbox, consuming unlimited memory.
>>>
>>> Assuming I understand the problem correctly, the way to fix it is to have
>>> the HTTP client not read the data until it's demanded by the consuming
>>> process. Then we are only using the default TCP buffers, not the Erlang
>>> message queues as a buffer, and the total amount of memory used at
>>> anytime
>>> is small.
>>>
>>
>> Dunno about HTTP clients, but when I was playing around with gen_tcp a
>> week or two ago I found a parameter to opening a socket that is
>> something like {active, false} that affects this specific
>> functionality. Active sockets send tcp data as Erlang messages,
>> inactive sockets don't and you have to get the data with
>> gen_tcp:recv(Sock).
>>
>> I haven't the foggiest if the HTTP bits expose any of that though.
>
> As far as I can tell, the {stream,{self,once}} translates to an
> inet:setopts(socket(), [{active,once}]), which accomplishes the same basic
> goal as {active,false}, just with repeated calls to
> setopts(Sock,[{active,once}]) instead of gen_tcp:recv(Sock).  I must be
> missing something, though, because clearly I'm getting more messages than I
> asked for.
>
> I'm sure I could cook up something simple using gen_tcp directly, but even
> I'll have to deal with authentication, ssl, etc. so I'd prefer to use a
> full-fledged HTTP client if I can get it to work.  Best,
>


Oscar from Erlang Training and Consulting has just open-sourced one of
their HTTP clients, which may be a better fit than ibrowse as it seems
to be a much thinner layer. There is some discussion of it on the
Erlang Questions list, but the most useful link is probably to the
source code:

http://bitbucket.org/etc/lhttpc

This does not support streaming the response body yet, but Oscar's
told me that it shouldn't be hard to add. So this may be just the
thing for getting a raw connection to the socket, without having to
worry about auth, ssl, etc.

Chris

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: replication using _changes API

Reply via email to