For example, this:
((StAXHTMLParser onURL: aURLString)
nextElementNamed: 'head')
ifNotNil: [:headElement | ...]
parses the document upto the next "head" element and returns it and any
descendants as a DOM subtree. If there's no next "head" element, it exhausts
the event stream looking for one. If you don't want that, test it first:
(parser peek isStartTagNamed: 'head')
ifTrue: [| headElement |
headElement := parser nextNode.
...].
because you now know what kind of DOM subtree the next events represent,
#nextNode is used, which builds any DOM subtree out of the next events,
including an element with descendants, a string or comment node, or even an
entire document (if sent before reading the start-of-document event). So this:
(StAXHTMLParser onURL: aURLString) nextNode
is equivalent to this:
XMLHTMLParser parseURL: aURLString.
StAX is more useful with XML than HTML, because XML documents can be huge.
> Sent: Tuesday, May 16, 2017 at 6:39 PM
> From: PBKResearch <[email protected]>
> To: "'Any question about pharo is welcome'" <[email protected]>
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc]
> ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Monty
>
> Many thanks for your help. I have followed your advice to start again in a
> clean Moose 6.1 image, and so far everything is working fine. Apologies for
> getting you to sort out the results of my stupidity. In Pharo I am really an
> experienced beginner.
>
> Thanks again
>
> Peter Kenny
>
> -----Original Message-----
> From: Pharo-users [mailto:[email protected]] On Behalf Of
> monty
> Sent: 16 May 2017 03:37
> To: [email protected]
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc]
> ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Something went wrong during your upgrade with class initialization.
>
> Installing the latest versions of these projects into a clean image would
> work, and so would installing the latest XMLParserHTML and XMLParserStAX into
> the newest Moose-6.1 image (which has the latest XMLParser and XPath).
>
> But if you insist on upgrading your old image, try the latest
> ConfigurationOfXMLParser (.303.mcz) and ConfigurationOfXPath (.149.mcz) from
> their PharoExtras repos and install their latest project versions, and do the
> same with XMLParserHTML and XMLParserStAX (the older versions aren't
> compatible with newer XMLParser versions). Then open the test runner and run
> all "XML|XPath" tests. If you get any failures, evaluate this:
>
> #('XML-Parser' 'XPath-Core') do: [:package |
> (SystemNavigation default allClassesInPackageNamed: package) do:
> [:class |
> class initialize]]
>
> and try running the tests again.
>
> > Sent: Monday, May 15, 2017 at 6:50 PM
> > From: PBKResearch <[email protected]>
> > To: "'Any question about pharo is welcome'" <[email protected]>
> > Subject: [Pharo-users] Problems loading XML System ( was Re: [Zinc]
> > ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
> >
> > Monty
> >
> > As an update, I have rebuilt from the Moose 6.0 download. The version of
> > XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I
> > installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with
> > that. (The respective configurations are monty.48 and monty.39). With these
> > versions all my previous XMLHTMLParser operations work as before, and I
> > have been able to use the StAX parser in a simple way. So I can start
> > exploring as I intended.
> >
> > I have made repeated attempts to update this rebuilt image to more recent
> > versions of the HTML and StAX parsers, and every time I run into the same
> > error reported below. I started from the latest version and worked
> > backwards, but gave up quickly; it takes about 6 minutes on my machine to
> > load and compile a version, and it soon gets tedious. If I feel more
> > enthusiastic tomorrow, I might start working forwards from my current
> > versions.
> >
> > Anyway, I now have a working system with the StaX and HTML parsers, so I
> > can continue to explore.
> >
> > Best wishes
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[email protected]] On Behalf Of
> > PBKResearch
> > Sent: 15 May 2017 20:44
> > To: 'Any question about pharo is welcome' <[email protected]>
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
> > utf-8 encoding
> >
> > Monty
> >
> > I have just started trying to use the StAX parsers, and I have found that
> > the update has introduced a problem, which means that XMLHTMLParser no
> > longer works on examples I have used before. I updated to
> > ConfigurationOfXMLParser(monty.302), which is the latest version on the
> > smalltalkhub repository, and then used the load version in the class
> > comment, which loads the stable default. Similarly, I loaded
> > ConfigurationOfXMLParserHTML(monty.62) and
> > ConfigurationOfXMLParserStAX(monty.51), again using stable and default.
> > When I try to run the XMLHTMLParser example I quoted below, I get an error
> > message 'MessageNotunderstood: receiver of "critical:" is nil'. The same
> > message comes up with anything else I try with XMLHTMLParser or with
> > StAXHTMLParser.
> >
> > I am not really up to using the debugger on someone else's code, but the
> > one thing I can see is that the problem lies in
> > XMLKeyValueCache>>critical:, which has the code:
> > ^ self mutex critical: aBlock
> > The problem being that mutex is nil.
> >
> > In my enthusiasm, I saved the updated image with the same name as the old
> > image, which is now therefore overwritten. If I cannot solve this problem,
> > my only way out is to rebuild my image from the Moose 6.0 download. Any
> > suggestions gratefully received.
> >
> > Thanks in advance
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[email protected]] On Behalf Of
> > PBKResearch
> > Sent: 15 May 2017 19:16
> > To: 'Any question about pharo is welcome' <[email protected]>
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
> > utf-8 encoding
> >
> > Monty
> >
> > Many thanks for this. My original purpose was just to answer Paul
> > deBruicker's query, namely to parse an html file and stop reading at the
> > end of the <head> section. I solved this by trial and error using the code
> > shown below ( which actually stops at the opening tag of the body). This
> > was not my problem at all, but Paul's; I just tackled it for fun.
> >
> > However, you note has prompted me to update my version of the whole XML
> > system - I was using the version I downloaded with Moose 6.0, which was
> > dated August 2016. I am looking at the StAX parsers as a possible way of
> > simplifying what I currently do, which involves downloading an entire web
> > page as a DOM and then manipulating it with XPath to extract the bits I am
> > interested in. I may be able to use StAX to do some of the selection and
> > manipulation as I am reading.
> >
> > It's all a new topic to me, so I foresee a lot of experimentation. It all
> > helps to keep the grey matter active.
> >
> > Thanks again
> >
> > Peter Kenny
> >
> > -----Original Message-----
> > From: Pharo-users [mailto:[email protected]] On Behalf Of
> > monty
> > Sent: 15 May 2017 12:15
> > To: [email protected]
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for
> > utf-8 encoding
> >
> > For that kind of incremental parsing, you could also use XMLParserStAX, a
> > pull-parser that parses a document as a stream of event objects you control
> > with #next, #peek, and #atEnd. It also supports pull-DOM parsing with
> > messages like #nextNode, #nextElement, and #nextElementNamed:, which return
> > the next event object(s) as DOM subtrees (searchable with XPath). See the
> > StAXParser class comment for an example. (The StAXHTMLParser class requires
> > XMLParserHTML be installed to work.)
> >
> >
> >
> >
> >
>
>
>