Re: replication using _changes API

Adam Kocoloski Fri, 12 Jun 2009 08:07:04 -0700

On Jun 12, 2009, at 10:59 AM, Paul Davis wrote:

On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<[email protected]>wrote:
On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:
Hi Damien, I'm not sure I follow.  My worry was that, if I built a
replicator which only queried _changes to get the list of updates,I'd haveto be prepared to process a very large response. I thought onesmart way toprocess this response was to throttle the download at the TCPlevel by
putting the socket into passive mode.
You will have a very large response, but you can stream it,processing oneline at a time, then you discard the line and process the next. Aslong asthe writer is using a blocking socket and the reader is onlyreading as muchdata as necessary to process a line, you never need to store muchof thedata in memory on either side. But it seems the HTTP client isbuffering the
data as it comes in, perhaps unintentionally.
With TCP, the sending side will only send so much data beforegetting anACK, acknowledgment that packets sent were actually received. Whenan ACKisn't received, the sender stops sending, and the TCP calls willblock atthe sender (or return an error if the socket is in non-blockingmode), until
it gets a response or socket timeout.
So if you have a non-buffering reader and a blocking sender, thenyou canstream the data and only relatively small amounts of data arebuffered atany time. The problem is the reader in the HTTP client isn'twaiting for thedata to be demanded at all, instead as soon as data comes in, itsends it to
a receiving erlang process. Erlang processes never block to receive
messages, so there is no limit to the amount of data buffered. Soif the
Erlang process can't process the data fast enough, it starts getting
buffered in it's mailbox, consuming unlimited memory.
Assuming I understand the problem correctly, the way to fix it isto havethe HTTP client not read the data until it's demanded by theconsumingprocess. Then we are only using the default TCP buffers, not theErlangmessage queues as a buffer, and the total amount of memory used atanytime
is small.
Dunno about HTTP clients, but when I was playing around with gen_tcp a
week or two ago I found a parameter to opening a socket that is
something like {active, false} that affects this specific
functionality. Active sockets send tcp data as Erlang messages,
inactive sockets don't and you have to get the data with
gen_tcp:recv(Sock).

I haven't the foggiest if the HTTP bits expose any of that though.

As far as I can tell, the {stream,{self,once}} translates to aninet:setopts(socket(), [{active,once}]), which accomplishes the samebasic goal as {active,false}, just with repeated calls to setopts(Sock,[{active,once}]) instead of gen_tcp:recv(Sock). I must be missingsomething, though, because clearly I'm getting more messages than Iasked for.

I'm sure I could cook up something simple using gen_tcp directly, buteven I'll have to deal with authentication, ssl, etc. so I'd prefer touse a full-fledged HTTP client if I can get it to work. Best,


Adam

Re: replication using _changes API

Reply via email to