On Jun 12, 2009, at 10:59 AM, Paul Davis wrote:
On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<[email protected]>
wrote:
On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote:
Hi Damien, I'm not sure I follow. My worry was that, if I built a
replicator which only queried _changes to get the list of updates,
I'd have
to be prepared to process a very large response. I thought one
smart way to
process this response was to throttle the download at the TCP
level by
putting the socket into passive mode.
You will have a very large response, but you can stream it,
processing one
line at a time, then you discard the line and process the next. As
long as
the writer is using a blocking socket and the reader is only
reading as much
data as necessary to process a line, you never need to store much
of the
data in memory on either side. But it seems the HTTP client is
buffering the
data as it comes in, perhaps unintentionally.
With TCP, the sending side will only send so much data before
getting an
ACK, acknowledgment that packets sent were actually received. When
an ACK
isn't received, the sender stops sending, and the TCP calls will
block at
the sender (or return an error if the socket is in non-blocking
mode), until
it gets a response or socket timeout.
So if you have a non-buffering reader and a blocking sender, then
you can
stream the data and only relatively small amounts of data are
buffered at
any time. The problem is the reader in the HTTP client isn't
waiting for the
data to be demanded at all, instead as soon as data comes in, it
sends it to
a receiving erlang process. Erlang processes never block to receive
messages, so there is no limit to the amount of data buffered. So
if the
Erlang process can't process the data fast enough, it starts getting
buffered in it's mailbox, consuming unlimited memory.
Assuming I understand the problem correctly, the way to fix it is
to have
the HTTP client not read the data until it's demanded by the
consuming
process. Then we are only using the default TCP buffers, not the
Erlang
message queues as a buffer, and the total amount of memory used at
anytime
is small.
Dunno about HTTP clients, but when I was playing around with gen_tcp a
week or two ago I found a parameter to opening a socket that is
something like {active, false} that affects this specific
functionality. Active sockets send tcp data as Erlang messages,
inactive sockets don't and you have to get the data with
gen_tcp:recv(Sock).
I haven't the foggiest if the HTTP bits expose any of that though.
As far as I can tell, the {stream,{self,once}} translates to an
inet:setopts(socket(), [{active,once}]), which accomplishes the same
basic goal as {active,false}, just with repeated calls to setopts(Sock,
[{active,once}]) instead of gen_tcp:recv(Sock). I must be missing
something, though, because clearly I'm getting more messages than I
asked for.
I'm sure I could cook up something simple using gen_tcp directly, but
even I'll have to deal with authentication, ssl, etc. so I'd prefer to
use a full-fledged HTTP client if I can get it to work. Best,
Adam