Hi Peter, > On 09 May 2015, at 19:00, PBKResearch <pe...@pbkresearch.co.uk> wrote: > > Sven > > Thanks for your considered response to my midnight thoughts. I now see the > importance of distinguishing the html side (Soup in my work) from the http > side (Zinc). I looked at your specimen byte arrays in a Pharo playground, > which I am still learning to use; I hadn't seen that proverb in German, but I > see the point you are making.
Haha, good ! > Now that my immediate problem is solved, I am not sure whether it is > necessary to take up your time any more with it. It depends how many servers > do not say what encoding they use, how many sites use encodings other than > UTF-8 and how big is the intersection of those sets. If my problem was a > one-off, there is no need to go any further. It happens, but not too often (anymore). > However, there is one general point I would like to make. The debugger is a > very good tool for programmers investigating their own code, but it is a very > unfriendly place for a user dumped in the middle of someone else's code; > yesterday I learned more than I ever wished to about the innards of UTF-8 > decoding, before realising it was all irrelevant. If you do think it worth > modifying the handling of this case, I would suggest replacing the call to > the debugger with a dialog box of some sort, perhaps with debug as an option > for the enthusiast, but perhaps also with an option to restart with an > alternative encoding. I understand your idea, but showing dialogs from system code is a no go, all that we can do is throw better or more specific exceptions, it is up to the code invoking things (your code/application) to handle those. > If I can, I would like to ask a supplementary question - I am deeply ignorant > but eager to learn. As I mentioned, I tried to get round the problem by > downloading the page source to a local file and reading from there into Soup, > but this also involved the Zinc decoder and so failed. I tried to see how to > get round this, using what I now know about the encoding, and came up with > the following: > > binaryStream := (FileStream readOnlyFileNamed: 'display.html') binary. > charStream := ZnCharacterReadStream on: binaryStream encoding: ZnByteEncoder > iso88591. > hbSoup := Soup fromString: charStream contents. Yes, that is perfect (did you see http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/ ?) I would write 'display.html' asFileReference binaryReadStreamDo: [ :in | Soup fromString: (ZnCharacterReadStream on: in encoding: ZnByteEncoder iso88591) upToEnd ]. or even 'display.html' asFileReference binaryReadStreamDo: [ :in | Soup fromString: (ZnByteEncoder iso88591 decodeBytes: in upToEnd) ]. It seems Soup does not accept streams, only strings. > This worked, so in that sense it is OK, but I wonder if there is a neater way > of doing it. More importantly, I found that Soup has its own decoder, so I > can skip the second line and replace the third by: > > hbSoup := Soup fromString: binaryStream contents asString. > > At one stage I found myself looking at a debugger on this process (I know - > this contradicts what I said above!), because I had not realised that > 'asString' was needed. It looked as though Soup was trying three candidate > encodings, which it labelled 'latin1', 'utf-8' and 'cp1252', to find which > one would work. It showed the one it had 'sniffed' as most likely being > 'latin1', which I think is the same as ISO-8859-1, so it was trying that > first. Yes Latin1, cp1252 and ISO88591 are equivalent for most purposes. BTW, #asString is also more or less the same (the difference is that there is a 'hole' in the encoding). > > Given this, my question is whether Zinc would allow me to read from a web URL > as a binary stream, which I could then feed into the Soup decoder in the same > way. If I can, I would use this as my standard procedure; I expect to be > visiting a lot of sites, and it would be handy to be able to ignore the > encoding issue and hope that Soup can sort it out. Downloading binary to a file goes like this: ZnClient new url: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk'; downloadTo: 'display.html'. If you would want the bytes in memory, you could do: | client bytes | (client := ZnClient new) streaming: true; get: 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk'. bytes := client entity contents. client close. bytes. > Finally, a general comment. Both this query and the one I posed earlier this > week, answered by Vincent Blondeau, showed that Pharo users can come to this > site and expect quick, friendly and expert help. I am retired and can devote > all the time I want to this, but you people must have day jobs as well! I am > really very grateful. To get anywhere, we have to help each other. Sven > Best wishes > > Peter Kenny > > -----Original Message----- > From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of > Sven Van Caekenberghe > Sent: 09 May 2015 07:51 > To: Any question about pharo is welcome > Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1) > > >> On 09 May 2015, at 02:18, PBKResearch <pe...@pbkresearch.co.uk> wrote: >> >> Sven >> >> Many thanks for the quick response. I always like to try to solve problems >> myself before appealing for help, so I had worked out what was wrong, but >> did not know how to tell Zinc to use a specific coding. I had tried by >> reading through your very full note on Zinc, but did not find the trick you >> describe - which works perfectly, of course. > > Good, yes this is a more recent thing. > >> It seems unfortunate that Zinc does not use the coding specified in the html >> head. Evidently browsers like Firefox must do it, since the page displays >> correctly. If it cannot be done, I think it would be helpful to reconsider >> the error message produced when the user is dumped out, because in this >> context it is misleading. I spent some time tracing debugger output, trying >> to work out what was wrong with the UTF-8, before I spotted that one of the >> bytes was displayed in character form as $ö, and began to suspect it might >> be a different coding; I finally confirmed this by reading the page source >> in Firefox. > > Zn deals with HTTP, not with HTML, these are totally different things, a > browser obviously does both. But even then there is no easy way to do this, > apart from trying. Consider these two byte arrays: > > #[85 84 70 56 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 195 182 108 > 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 > 115 195 164 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46] > > #[73 83 79 56 56 53 57 49 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 > 246 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 > 111 114 115 228 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 > 46] > > In them it says how you should decode them! > > The GT tools make this challenge easy because there is a tab that tries both > encodings, but in general this is hard to solve (efficiently). > > But since Zn does not do HTML, it will never be added at that level. > > I will think about the error, it might indeed be useful to tell the user that > a default encoding was chosen. > >> Thanks again for your help. > > You're welcome. > >> Peter Kenny >> >> -----Original Message----- >> From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of >> Sven Van Caekenberghe >> Sent: 08 May 2015 20:04 >> To: Any question about pharo is welcome >> Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1) >> >> Peter, >> >> Thanks for the URL, it makes it much easier to help you. >> >> The answer is easy: the server is incorrect, it serves a specific encoding >> without saying so. >> >> Consider: >> >> (ZnClient new >> head: >> 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk'; >> >> response) contentType. >> >> => 'text/html' >> >> If no charset/encoding is specified, the modern default is UTF-8, so Zn >> tries that but fails. >> >> You can change the default for unspecified encoding as follows: >> >> ZnDefaultCharacterEncoder >> value: ZnByteEncoder iso88591 >> during: [ >> ZnClient new >> get: >> 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk' >> ]. >> >> The server should have used the following mime type to avoid the confusion: >> >> ZnMimeType textHtml charSet: #iso88591 >> >> => 'text/html;charset=iso88591' >> >> HTH, >> >> Sven >> >> PS: the encoding inside the document cannot be used because (1) no >> interpretation inside documents is done and (2) at that point it is too >> late, the contents is already converted from bytes to characters >> >>> On 08 May 2015, at 18:51, PBKResearch <pe...@pbkresearch.co.uk> wrote: >>> >>> Hello >>> >>> I have been trying to use Soup class>> fromUrl: to access the contents of a >>> web page. It halts with a message from Zinc about malformed UTF-8. The page >>> displays perfectly in Firefox, so I copied the page source from there to a >>> local file and tried to read it from there. Again a message from Zinc: >>> 'Invalid utf8 input detected'. It’s strange, because the page is not in >>> UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" >>> http-equiv="Content-Type">. I have tried to find how to specify the >>> character set in reading files with Zinc, but without success.* >>> >>> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two >>> days ago. The address of the web page is: >>> http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk. >>> Other pages from the same source are loaded and analysed with no problem. >>> Processing this page seems to go off course as soon as it encounters the >>> character code 246, which is a correct o-umlaut in ISO-8859-1. >>> >>> Any advice gratefully received. >>> >>> Peter Kenny >>> >>> *I would be happy with advice to RTFM, if someone would point out the >>> relevant bit of the FM. >> >> >> > > >