Re: replication using _changes API

Adam Kocoloski Fri, 12 Jun 2009 15:38:36 -0700

On Jun 12, 2009, at 5:56 PM, Chris Anderson wrote:

On Fri, Jun 12, 2009 at 8:06 AM, Adam Kocoloski<[email protected]>wrote:
On Jun 12, 2009, at 10:59 AM, Paul Davis wrote:
On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<[email protected]>wrote:
On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:
Hi Damien, I'm not sure I follow.  My worry was that, if I built a
replicator which only queried _changes to get the list ofupdates, I'd
have
to be prepared to process a very large response. I thought onesmart
way to
process this response was to throttle the download at the TCPlevel by
putting the socket into passive mode.
You will have a very large response, but you can stream it,processing
one
line at a time, then you discard the line and process the next.As long
as
the writer is using a blocking socket and the reader is onlyreading as
much
data as necessary to process a line, you never need to store muchof thedata in memory on either side. But it seems the HTTP client isbuffering
the
data as it comes in, perhaps unintentionally.
With TCP, the sending side will only send so much data beforegetting anACK, acknowledgment that packets sent were actually received.When an ACKisn't received, the sender stops sending, and the TCP calls willblock atthe sender (or return an error if the socket is in non-blockingmode),
until
it gets a response or socket timeout.
So if you have a non-buffering reader and a blocking sender, thenyou canstream the data and only relatively small amounts of data arebuffered atany time. The problem is the reader in the HTTP client isn'twaiting for
the
data to be demanded at all, instead as soon as data comes in, itsends it
to
a receiving erlang process. Erlang processes never block to receive
messages, so there is no limit to the amount of data buffered. Soif theErlang process can't process the data fast enough, it startsgetting
buffered in it's mailbox, consuming unlimited memory.
Assuming I understand the problem correctly, the way to fix it isto havethe HTTP client not read the data until it's demanded by theconsumingprocess. Then we are only using the default TCP buffers, not theErlang
message queues as a buffer, and the total amount of memory used at
anytime
is small.
Dunno about HTTP clients, but when I was playing around withgen_tcp a
week or two ago I found a parameter to opening a socket that is
something like {active, false} that affects this specific
functionality. Active sockets send tcp data as Erlang messages,
inactive sockets don't and you have to get the data with
gen_tcp:recv(Sock).

I haven't the foggiest if the HTTP bits expose any of that though.
As far as I can tell, the {stream,{self,once}} translates to an
inet:setopts(socket(), [{active,once}]), which accomplishes thesame basic
goal as {active,false}, just with repeated calls to
setopts(Sock,[{active,once}]) instead of gen_tcp:recv(Sock). Imust bemissing something, though, because clearly I'm getting moremessages than I
asked for.
I'm sure I could cook up something simple using gen_tcp directly,but evenI'll have to deal with authentication, ssl, etc. so I'd prefer touse a
full-fledged HTTP client if I can get it to work.  Best,
Oscar from Erlang Training and Consulting has just open-sourced one of
their HTTP clients, which may be a better fit than ibrowse as it seems
to be a much thinner layer. There is some discussion of it on the
Erlang Questions list, but the most useful link is probably to the
source code:

http://bitbucket.org/etc/lhttpc

This does not support streaming the response body yet, but Oscar's
told me that it shouldn't be hard to add. So this may be just the
thing for getting a raw connection to the socket, without having to
worry about auth, ssl, etc.

Hi Chris, I saw Oscar's announcement on erlang-questions and checkedout the code. It definitely won't work for us without the ability toread a chunked response, but if he adds that and the streaming optionI'm sure it'll be worth a closer look.

In my opinion it's unfortunate that we have a proliferation of HTTPclients instead of one really solid implementation. I'm sure ETC hadits reasons for starting from scratch instead of contributing toibrowse or inets. Oscar certainly did a great job of describing thelimitations of both in these two messages:


http://groups.google.com/group/erlang-programming/browse_thread/thread/bc5db72fbe2ac9c7
http://groups.google.com/group/erlang-programming/browse_thread/thread/a896b641348a50ca

Cheers, Adam

Re: replication using _changes API

Reply via email to