Hi, PetitParser2 [1] supports parsing of streams. I have been experimenting with ZnClient and come up with the following solution:
1) Create a PP2 stream from ZnClient stream: byteStream := ZnClient new url: 'http://pharo.org'; streaming: true; get. stream := PP2CharacterStream on: byteStream encoder: ZnUTF8Encoder new. 2) Create a parser for header: head := '<head>' asPParser, #any asPParser starLazy, '</head>' asPParser. 3) Create a parser that reads everything up till header or body (in case header is not present) and parse the header: headStart := '<head' asPParser. bodyStart := '<body' asPParser. parser := (#any asPParser starLazy: (headStart / bodyStart)), head ==> #second. result := parser optimize parse: stream. 4) Finally, the contents of header is a collection of characters, I don't know what is the best way to convert it into a string, perhaps this: text := (result second inject: (WriteStream on: '') into: [ :stream :char | stream nextPut: char. stream ]) contents Cheers, Jan [1]: https://github.com/kursjan/petitparser2 On Sun, Nov 27, 2016 at 1:38 PM PBKResearch <pe...@pbkresearch.co.uk> wrote: > Paul > > Not sure if this is helpful - I have not tried it out, but it may give you > a > pointer. > > As Sven says, you need to parse a stream and be able to stop when you reach > the desired point. If instead of Soup you use XMLHTMLParser, this has > streaming siblings called SAXHTMLHandler and SAX2HTMLParser. I think it > should be possible to use one or the other to stop when you reach the > </head> tag. > > Personally I find the output of XMLHTMLParser easier to follow than that of > Soup, but this may be a matter of taste. > > Hope this helps > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf > Of > Sven Van Caekenberghe > Sent: 26 November 2016 18:19 > To: Any question about pharo is welcome <pharo-users@lists.pharo.org> > Subject: Re: [Pharo-users] ZnClient GET, but just the content of the <head> > tag? > > Paul, > > > On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <pdebr...@gmail.com> wrote: > > > > This is a micro optimization if there ever was one but I wondered if it > was possible to stop downloading and get the entity once the </head> tag > has > been received. > > > > Right now I download the whole page, parse it with Soup, then extract the > tags I want from the head. Which works fine. e.g. > > > > head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity) > > findChildTag: 'html') findChildTag: 'head'. > > This would only be useful for large pages. Dealing with the content of > resources (like parsing HTML) is outside the scope of Zinc. However, I can > help you get started. > > What you want to do is use streaming. That gives you access to the content > of a resource using a direct stream, so you could decide to stop reading > (but then you have to close the connection, else you need to read > everything > anyway). > > Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity. > What > you want to do is more or less the following. > > ZnClient new > url: 'http://pharo.org'; > streaming: true; > get. > > At this point, the request is done, the response is in, but the entity of > the response is not yet read. When you ask for the entity, you get a > ZnStreamingEntity which holds the stream that you then have to read from. > You can check the response (and its header) for meta info. > > Your next challenge then is to process this stream so that you can parse it > in a real streaming fashion. I don't know if Soup can do this. > > Sven > > > > >