Re: [Pharo-users] ZnClient GET, but just the content of the tag?

Jan Kurš Sun, 27 Nov 2016 10:48:06 -0800

Hi,

PetitParser2 [1] supports parsing of streams. I have been experimenting
with ZnClient and come up with the following solution:


1) Create a PP2 stream from ZnClient stream:
byteStream := ZnClient new
  url: 'http://pharo.org';
  streaming: true;
  get.
stream := PP2CharacterStream on: byteStream encoder: ZnUTF8Encoder new.

2) Create a parser for header:
head := '<head>' asPParser, #any asPParser starLazy, '</head>' asPParser.

3) Create a parser that reads everything up till header or body (in case
header is not present) and parse the header:
headStart := '<head' asPParser.
bodyStart := '<body' asPParser.
parser := (#any asPParser starLazy: (headStart / bodyStart)), head ==>
#second.

result := parser optimize parse: stream.

4) Finally, the contents of header is a collection of characters, I don't
know what is the best way to convert it into a string, perhaps this:
text := (result second inject: (WriteStream on: '') into: [ :stream :char |
stream nextPut: char. stream ]) contents

Cheers,
Jan

[1]: https://github.com/kursjan/petitparser2

On Sun, Nov 27, 2016 at 1:38 PM PBKResearch <pe...@pbkresearch.co.uk> wrote:

> Paul
>
> Not sure if this is helpful - I have not tried it out, but it may give you
> a
> pointer.
>
> As Sven says, you need to parse a stream and be able to stop when you reach
> the desired point. If instead of Soup you use XMLHTMLParser, this has
> streaming siblings called SAXHTMLHandler and SAX2HTMLParser. I think it
> should be possible to use one or the other to stop when you reach the
> </head> tag.
>
> Personally I find the output of XMLHTMLParser easier to follow than that of
> Soup, but this may be a matter of taste.
>
> Hope this helps
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf
> Of
> Sven Van Caekenberghe
> Sent: 26 November 2016 18:19
> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Subject: Re: [Pharo-users] ZnClient GET, but just the content of the <head>
> tag?
>
> Paul,
>
> > On 26 Nov 2016, at 18:31, PAUL DEBRUICKER <pdebr...@gmail.com> wrote:
> >
> > This is a micro optimization if there ever was one but I wondered if it
> was possible to stop downloading and get the entity once the </head> tag
> has
> been received.
> >
> > Right now I download the whole page, parse it with Soup, then extract the
> tags I want from the head.  Which works fine.  e.g.
> >
> > head:=((Soup fromString: (ZnEasy get: 'http://pharo.org') entity)
> >                               findChildTag: 'html') findChildTag: 'head'.
>
> This would only be useful for large pages. Dealing with the content of
> resources (like parsing HTML) is outside the scope of Zinc. However, I can
> help you get started.
>
> What you want to do is use streaming. That gives you access to the content
> of a resource using a direct stream, so you could decide to stop reading
> (but then you have to close the connection, else you need to read
> everything
> anyway).
>
> Start by having a look at ZnClient>>#downloadTo: and ZnStreamingEntity.
> What
> you want to do is more or less the following.
>
> ZnClient new
>   url: 'http://pharo.org';
>   streaming: true;
>   get.
>
> At this point, the request is done, the response is in, but the entity of
> the response is not yet read. When you ask for the entity, you get a
> ZnStreamingEntity which holds the stream that you then have to read from.
> You can check the response (and its header) for meta info.
>
> Your next challenge then is to process this stream so that you can parse it
> in a real streaming fashion. I don't know if Soup can do this.
>
> Sven
>
>
>
>
>

Re: [Pharo-users] ZnClient GET, but just the content of the tag?

Reply via email to