On Jun 12, 2009, at 5:56 PM, Chris Anderson wrote:

On Fri, Jun 12, 2009 at 8:06 AM, Adam Kocoloski<[email protected]> wrote:
On Jun 12, 2009, at 10:59 AM, Paul Davis wrote:

On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<[email protected]> wrote:


On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:

Hi Damien, I'm not sure I follow.  My worry was that, if I built a
replicator which only queried _changes to get the list of updates, I'd
have
to be prepared to process a very large response. I thought one smart
way to
process this response was to throttle the download at the TCP level by
putting the socket into passive mode.

You will have a very large response, but you can stream it, processing
one
line at a time, then you discard the line and process the next. As long
as
the writer is using a blocking socket and the reader is only reading as
much
data as necessary to process a line, you never need to store much of the data in memory on either side. But it seems the HTTP client is buffering
the
data as it comes in, perhaps unintentionally.

With TCP, the sending side will only send so much data before getting an ACK, acknowledgment that packets sent were actually received. When an ACK isn't received, the sender stops sending, and the TCP calls will block at the sender (or return an error if the socket is in non-blocking mode),
until
it gets a response or socket timeout.

So if you have a non-buffering reader and a blocking sender, then you can stream the data and only relatively small amounts of data are buffered at any time. The problem is the reader in the HTTP client isn't waiting for
the
data to be demanded at all, instead as soon as data comes in, it sends it
to
a receiving erlang process. Erlang processes never block to receive
messages, so there is no limit to the amount of data buffered. So if the Erlang process can't process the data fast enough, it starts getting
buffered in it's mailbox, consuming unlimited memory.

Assuming I understand the problem correctly, the way to fix it is to have the HTTP client not read the data until it's demanded by the consuming process. Then we are only using the default TCP buffers, not the Erlang
message queues as a buffer, and the total amount of memory used at
anytime
is small.


Dunno about HTTP clients, but when I was playing around with gen_tcp a
week or two ago I found a parameter to opening a socket that is
something like {active, false} that affects this specific
functionality. Active sockets send tcp data as Erlang messages,
inactive sockets don't and you have to get the data with
gen_tcp:recv(Sock).

I haven't the foggiest if the HTTP bits expose any of that though.

As far as I can tell, the {stream,{self,once}} translates to an
inet:setopts(socket(), [{active,once}]), which accomplishes the same basic
goal as {active,false}, just with repeated calls to
setopts(Sock,[{active,once}]) instead of gen_tcp:recv(Sock). I must be missing something, though, because clearly I'm getting more messages than I
asked for.

I'm sure I could cook up something simple using gen_tcp directly, but even I'll have to deal with authentication, ssl, etc. so I'd prefer to use a
full-fledged HTTP client if I can get it to work.  Best,


Oscar from Erlang Training and Consulting has just open-sourced one of
their HTTP clients, which may be a better fit than ibrowse as it seems
to be a much thinner layer. There is some discussion of it on the
Erlang Questions list, but the most useful link is probably to the
source code:

http://bitbucket.org/etc/lhttpc

This does not support streaming the response body yet, but Oscar's
told me that it shouldn't be hard to add. So this may be just the
thing for getting a raw connection to the socket, without having to
worry about auth, ssl, etc.

Hi Chris, I saw Oscar's announcement on erlang-questions and checked out the code. It definitely won't work for us without the ability to read a chunked response, but if he adds that and the streaming option I'm sure it'll be worth a closer look.

In my opinion it's unfortunate that we have a proliferation of HTTP clients instead of one really solid implementation. I'm sure ETC had its reasons for starting from scratch instead of contributing to ibrowse or inets. Oscar certainly did a great job of describing the limitations of both in these two messages:

http://groups.google.com/group/erlang-programming/browse_thread/thread/bc5db72fbe2ac9c7
http://groups.google.com/group/erlang-programming/browse_thread/thread/a896b641348a50ca

Cheers, Adam

Reply via email to