Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

monty Wed, 17 May 2017 14:11:10 -0700

For example, this:
((StAXHTMLParser onURL: aURLString)
        nextElementNamed: 'head')
                ifNotNil: [:headElement | ...]


parses the document upto the next "head" element and returns it and any 
descendants as a DOM subtree. If there's no next "head" element, it exhausts 
the event stream looking for one. If you don't want that, test it first:
(parser peek isStartTagNamed: 'head')
        ifTrue: [| headElement |
                headElement := parser nextNode.
                ...].

because you now know what kind of DOM subtree the next events represent, 
#nextNode is used, which builds any DOM subtree out of the next events, 
including an element with descendants, a string or comment node, or even an 
entire document (if sent before reading the start-of-document event). So this:
(StAXHTMLParser onURL: aURLString) nextNode

is equivalent to this:
XMLHTMLParser parseURL: aURLString.

StAX is more useful with XML than HTML, because XML documents can be huge.

> Sent: Tuesday, May 16, 2017 at 6:39 PM
> From: PBKResearch <[email protected]>
> To: "'Any question about pharo is welcome'" <[email protected]>
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] 
> ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
>
> Monty
> 
> Many thanks for your help. I have followed your advice to start again in a 
> clean Moose 6.1 image, and so far everything is working fine. Apologies for 
> getting you to sort out the results of my stupidity. In Pharo I am really an 
> experienced beginner.
> 
> Thanks again
> 
> Peter Kenny
> 
> -----Original Message-----
> From: Pharo-users [mailto:[email protected]] On Behalf Of 
> monty
> Sent: 16 May 2017 03:37
> To: [email protected]
> Subject: Re: [Pharo-users] Problems loading XML System ( was [Zinc] 
> ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
> 
> Something went wrong during your upgrade with class initialization.
> 
> Installing the latest versions of these projects into a clean image would 
> work, and so would installing the latest XMLParserHTML and XMLParserStAX into 
> the newest Moose-6.1 image (which has the latest XMLParser and XPath).
> 
> But if you insist on upgrading your old image, try the latest 
> ConfigurationOfXMLParser (.303.mcz) and ConfigurationOfXPath (.149.mcz) from 
> their PharoExtras repos and install their latest project versions, and do the 
> same with XMLParserHTML and XMLParserStAX (the older versions aren't 
> compatible with newer XMLParser versions). Then open the test runner and run 
> all "XML|XPath" tests. If you get any failures, evaluate this:
> 
> #('XML-Parser' 'XPath-Core') do: [:package |
>       (SystemNavigation default allClassesInPackageNamed: package) do: 
> [:class |
>               class initialize]]
> 
> and try running the tests again.
> 
> > Sent: Monday, May 15, 2017 at 6:50 PM
> > From: PBKResearch <[email protected]>
> > To: "'Any question about pharo is welcome'" <[email protected]>
> > Subject: [Pharo-users] Problems loading XML System ( was Re: [Zinc] 
> > ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)
> >
> > Monty
> > 
> > As an update, I have rebuilt from the Moose 6.0 download. The version of 
> > XML-Parser in that was dated 18 July 2016 (configuration monty.233), so I 
> > installed versions of XML-Parser-HTML and XML-Parser-StAX contemporary with 
> > that. (The respective configurations are monty.48 and monty.39). With these 
> > versions all my previous XMLHTMLParser operations work as before, and I 
> > have been able to use the StAX parser in a simple way. So I can start 
> > exploring as I intended.
> > 
> > I have made repeated attempts to update this rebuilt image to more recent 
> > versions of the HTML and StAX parsers, and every time I run into the same 
> > error reported below. I started from the latest version and worked 
> > backwards, but gave up quickly; it takes about 6 minutes on my machine to 
> > load and compile a version, and it soon gets tedious. If I feel more 
> > enthusiastic tomorrow, I might start working forwards from my current 
> > versions.
> > 
> > Anyway, I now have a working system with the StaX and HTML parsers, so I 
> > can continue to explore.
> > 
> > Best wishes
> > 
> > Peter Kenny
> > 
> > -----Original Message-----
> > From: Pharo-users [mailto:[email protected]] On Behalf Of 
> > PBKResearch
> > Sent: 15 May 2017 20:44
> > To: 'Any question about pharo is welcome' <[email protected]>
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for 
> > utf-8 encoding
> > 
> > Monty
> > 
> > I have just started trying to use the StAX parsers, and I have found that 
> > the update has introduced a problem, which means that XMLHTMLParser no 
> > longer works on examples I have used before. I updated to 
> > ConfigurationOfXMLParser(monty.302), which is the latest version on the 
> > smalltalkhub repository, and then used the load version in the class 
> > comment, which loads the stable default. Similarly, I loaded 
> > ConfigurationOfXMLParserHTML(monty.62) and 
> > ConfigurationOfXMLParserStAX(monty.51), again using stable and default. 
> > When I try to run the XMLHTMLParser example I quoted below, I get an error 
> > message 'MessageNotunderstood: receiver of "critical:" is nil'. The same 
> > message comes up with anything else I try with XMLHTMLParser or with 
> > StAXHTMLParser.
> > 
> > I am not really up to using the debugger on someone else's code, but the 
> > one thing I can see is that the problem lies in 
> > XMLKeyValueCache>>critical:, which has the code:
> > ^ self mutex critical: aBlock
> > The problem being that mutex is nil. 
> > 
> > In my enthusiasm, I saved the updated image with the same name as the old 
> > image, which is now therefore overwritten. If I cannot solve this problem, 
> > my only way out is to rebuild my image from the Moose 6.0 download. Any 
> > suggestions gratefully received.
> > 
> > Thanks in advance
> > 
> > Peter Kenny
> > 
> > -----Original Message-----
> > From: Pharo-users [mailto:[email protected]] On Behalf Of 
> > PBKResearch
> > Sent: 15 May 2017 19:16
> > To: 'Any question about pharo is welcome' <[email protected]>
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for 
> > utf-8 encoding
> > 
> > Monty
> > 
> > Many thanks for this. My original purpose was just to answer Paul 
> > deBruicker's query, namely to parse an html file and stop reading at the 
> > end of the <head> section. I solved this by trial and error using the code 
> > shown below ( which actually stops at the opening tag of the body). This 
> > was not my problem at all, but Paul's; I just tackled it for fun.
> > 
> > However, you note has prompted me to update my version of the whole XML 
> > system - I was using the version I downloaded with Moose 6.0, which was 
> > dated August 2016. I am looking at the StAX parsers as a possible way of 
> > simplifying what I currently do, which involves downloading an entire web 
> > page as a DOM and then manipulating it with XPath to extract the bits I am 
> > interested in. I may be able to use StAX to do some of the selection and 
> > manipulation as I am reading.
> > 
> > It's all a new topic to me, so I foresee a lot of experimentation. It all 
> > helps to keep the grey matter active.
> > 
> > Thanks again
> > 
> > Peter Kenny
> > 
> > -----Original Message-----
> > From: Pharo-users [mailto:[email protected]] On Behalf Of 
> > monty
> > Sent: 15 May 2017 12:15
> > To: [email protected]
> > Subject: Re: [Pharo-users] [Zinc] ZnInvalidUTF8: Illegal leading byte for 
> > utf-8 encoding
> > 
> > For that kind of incremental parsing, you could also use XMLParserStAX, a 
> > pull-parser that parses a document as a stream of event objects you control 
> > with #next, #peek, and #atEnd. It also supports pull-DOM parsing with 
> > messages like #nextNode, #nextElement, and #nextElementNamed:, which return 
> > the next event object(s) as DOM subtrees (searchable with XPath). See the 
> > StAXParser class comment for an example. (The StAXHTMLParser class requires 
> > XMLParserHTML be installed to work.)
> > 
> > 
> > 
> > 
> > 
> 
> 
>

Re: [Pharo-users] Problems loading XML System ( was [Zinc] ZnInvalidUTF8: Illegal leading byte for utf-8 encoding)

Reply via email to