Paul

Good to have found the charset discrepancy - that may have something to do with 
it. But I don't think it has to do with the C’è in the body of the page. I have 
just parsed another page published today, with the same error, and again it 
fails in parsing the <head> node, so it has not even reached the body. The 
<head> contains a meta which describes the article - a sort of paraphrase of 
the article headline - and it fails in the middle of decoding that. The 
character at which it fails is again $«, so that is definitely the cause. Maybe 
the wrong charset is the explanation of why it messes up that - but I don't 
know enough about the different charsets to know. Does ISO-8859-1 even contain 
$«?

Peter

-----Original Message-----
From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
Paul DeBruicker
Sent: 08 October 2017 18:41
To: pharo-users@lists.pharo.org
Subject: Re: [Pharo-users] Problem with input to XML Parser - 'Invalid UTF8 
encoding'

in the HEAD tag of that page with the article they declare it is ISO-8859-1 and 
not UTF-8.  In the page they have a  

C’è 

The little back-tick next to the C is UTF8 8217
(http://www.codetable.net/decimal/8217)


So their encoding is messed up, and maybe the XMLHTMLParser should throw a 
warning or something if there is a mismatch.  


Glad you found a work around. 








In another thread (on SVG Icons) Sven referred to ways of getting input from a 
URL for XMLDOMParser. I have recently had some problems doing this. I have 
found a workaround, so it is not urgent, but I thought I should put it on 
record in case anyone else is bitten by it, and so maybe Monty can look at it.

 

I am using the subclass XMLHTMLParser, and my usual way of invoking it was:

1.      XMLHTMLParser parseURL: <urlstring>.

This works in most cases, but with one particular site - 
http://www.corriere.it/....., which is an Italian newspaper - I had frequent 
failures, with the error message 'Invalid UTF8 encoding'. The parser has the 
option of parsing a string, which is obtained by other means, so I tried 
reading it in with Zinc:

2.      XMLHTMLParser parse: <urlstring> asZnUrl retrieveContents.

And this worked, so clearly the encoding on the site is OK. I realised that the 
XML-Parser package has its own methods, which reproduce a lot of the 
functionality of Zinc, so I tried the equivalent:

3.      XMLHTMLParser parse: <urlstring> asXMLURI get.

To my surprise, this worked equally well. I had expected problems, because 
presumably forms (1) and (3) use the same UTF8 decoding.

 

For now, I am using the form (3) for all my work, and have not had any problems 
since. So the message to anyone who is using the form (1) and getting problems 
is to try (2) or (3) instead.

 

I am using Moose 6.1 (Pharo 6.0 Latest update: #60486) on Windows 10. I think 
most articles on the Corriere web site will generate the error, but one which 
has always failed for me is:

http://www.corriere.it/esteri/17_ottobre_03/felipe-spagna-catalogna-discorso
-8f7ac0d6-a86d-11e7-a090-96160224e787.shtml

I tried to trace through the error using the debugger, but it got too 
confusing. However, I did establish that the failure occurred early in decoding 
the HTML <head>, in the line beginning <meta name=&quot;description&quot;..
The only unusual thing at this point is the French-style open-quote: '&lt;'.
Whether this could explain the problem I don't know.

 

Any suggestions gratefully received.

 

Peter Kenny
&lt;/quote>




--
Sent from: http://forum.world.st/Pharo-Smalltalk-Users-f1310670.html


Reply via email to