On Fri, Jun 12, 2009 at 8:06 AM, Adam Kocoloski<[email protected]> wrote: > On Jun 12, 2009, at 10:59 AM, Paul Davis wrote: > >> On Fri, Jun 12, 2009 at 10:47 AM, Damien Katz<[email protected]> wrote: >>> >>> >>> On Jun 12, 2009, at 8:59 AM, Adam Kocoloski wrote: >>> >>>> Hi Damien, I'm not sure I follow. My worry was that, if I built a >>>> replicator which only queried _changes to get the list of updates, I'd >>>> have >>>> to be prepared to process a very large response. I thought one smart >>>> way to >>>> process this response was to throttle the download at the TCP level by >>>> putting the socket into passive mode. >>> >>> You will have a very large response, but you can stream it, processing >>> one >>> line at a time, then you discard the line and process the next. As long >>> as >>> the writer is using a blocking socket and the reader is only reading as >>> much >>> data as necessary to process a line, you never need to store much of the >>> data in memory on either side. But it seems the HTTP client is buffering >>> the >>> data as it comes in, perhaps unintentionally. >>> >>> With TCP, the sending side will only send so much data before getting an >>> ACK, acknowledgment that packets sent were actually received. When an ACK >>> isn't received, the sender stops sending, and the TCP calls will block at >>> the sender (or return an error if the socket is in non-blocking mode), >>> until >>> it gets a response or socket timeout. >>> >>> So if you have a non-buffering reader and a blocking sender, then you can >>> stream the data and only relatively small amounts of data are buffered at >>> any time. The problem is the reader in the HTTP client isn't waiting for >>> the >>> data to be demanded at all, instead as soon as data comes in, it sends it >>> to >>> a receiving erlang process. Erlang processes never block to receive >>> messages, so there is no limit to the amount of data buffered. So if the >>> Erlang process can't process the data fast enough, it starts getting >>> buffered in it's mailbox, consuming unlimited memory. >>> >>> Assuming I understand the problem correctly, the way to fix it is to have >>> the HTTP client not read the data until it's demanded by the consuming >>> process. Then we are only using the default TCP buffers, not the Erlang >>> message queues as a buffer, and the total amount of memory used at >>> anytime >>> is small. >>> >> >> Dunno about HTTP clients, but when I was playing around with gen_tcp a >> week or two ago I found a parameter to opening a socket that is >> something like {active, false} that affects this specific >> functionality. Active sockets send tcp data as Erlang messages, >> inactive sockets don't and you have to get the data with >> gen_tcp:recv(Sock). >> >> I haven't the foggiest if the HTTP bits expose any of that though. > > As far as I can tell, the {stream,{self,once}} translates to an > inet:setopts(socket(), [{active,once}]), which accomplishes the same basic > goal as {active,false}, just with repeated calls to > setopts(Sock,[{active,once}]) instead of gen_tcp:recv(Sock). I must be > missing something, though, because clearly I'm getting more messages than I > asked for. > > I'm sure I could cook up something simple using gen_tcp directly, but even > I'll have to deal with authentication, ssl, etc. so I'd prefer to use a > full-fledged HTTP client if I can get it to work. Best, >
Oscar from Erlang Training and Consulting has just open-sourced one of their HTTP clients, which may be a better fit than ibrowse as it seems to be a much thinner layer. There is some discussion of it on the Erlang Questions list, but the most useful link is probably to the source code: http://bitbucket.org/etc/lhttpc This does not support streaming the response body yet, but Oscar's told me that it shouldn't be hard to add. So this may be just the thing for getting a raw connection to the socket, without having to worry about auth, ssl, etc. Chris -- Chris Anderson http://jchrisa.net http://couch.io
