On Jun 12, 2009, at 10:59 AM, Paul Davis wrote:

On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<[email protected]> wrote:


On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:

Hi Damien, I'm not sure I follow.  My worry was that, if I built a
replicator which only queried _changes to get the list of updates, I'd have to be prepared to process a very large response. I thought one smart way to process this response was to throttle the download at the TCP level by
putting the socket into passive mode.

You will have a very large response, but you can stream it, processing one line at a time, then you discard the line and process the next. As long as the writer is using a blocking socket and the reader is only reading as much data as necessary to process a line, you never need to store much of the data in memory on either side. But it seems the HTTP client is buffering the
data as it comes in, perhaps unintentionally.

With TCP, the sending side will only send so much data before getting an ACK, acknowledgment that packets sent were actually received. When an ACK isn't received, the sender stops sending, and the TCP calls will block at the sender (or return an error if the socket is in non-blocking mode), until
it gets a response or socket timeout.

So if you have a non-buffering reader and a blocking sender, then you can stream the data and only relatively small amounts of data are buffered at any time. The problem is the reader in the HTTP client isn't waiting for the data to be demanded at all, instead as soon as data comes in, it sends it to
a receiving erlang process. Erlang processes never block to receive
messages, so there is no limit to the amount of data buffered. So if the
Erlang process can't process the data fast enough, it starts getting
buffered in it's mailbox, consuming unlimited memory.

Assuming I understand the problem correctly, the way to fix it is to have the HTTP client not read the data until it's demanded by the consuming process. Then we are only using the default TCP buffers, not the Erlang message queues as a buffer, and the total amount of memory used at anytime
is small.


Dunno about HTTP clients, but when I was playing around with gen_tcp a
week or two ago I found a parameter to opening a socket that is
something like {active, false} that affects this specific
functionality. Active sockets send tcp data as Erlang messages,
inactive sockets don't and you have to get the data with
gen_tcp:recv(Sock).

I haven't the foggiest if the HTTP bits expose any of that though.

As far as I can tell, the {stream,{self,once}} translates to an inet:setopts(socket(), [{active,once}]), which accomplishes the same basic goal as {active,false}, just with repeated calls to setopts(Sock, [{active,once}]) instead of gen_tcp:recv(Sock). I must be missing something, though, because clearly I'm getting more messages than I asked for.

I'm sure I could cook up something simple using gen_tcp directly, but even I'll have to deal with authentication, ssl, etc. so I'd prefer to use a full-fledged HTTP client if I can get it to work. Best,

Adam

Reply via email to