Hi Peter,

> On 09 May 2015, at 19:00, PBKResearch <pe...@pbkresearch.co.uk> wrote:
> 
> Sven
> 
> Thanks for your considered response to my midnight thoughts. I now see the 
> importance of distinguishing the html side (Soup in my work) from the http 
> side (Zinc). I looked at your specimen byte arrays in a Pharo playground, 
> which I am still learning to use; I hadn't seen that proverb in German, but I 
> see the point you are making.

Haha, good !

> Now that my immediate problem is solved, I am not sure whether it is 
> necessary to take up your time any more with it. It depends how many servers 
> do not say what encoding they use, how many sites use encodings other than 
> UTF-8 and how big is the intersection of those sets. If my problem was a 
> one-off, there is no need to go any further.

It happens, but not too often (anymore).

> However, there is one general point I would like to make. The debugger is a 
> very good tool for programmers investigating their own code, but it is a very 
> unfriendly place for a user dumped in the middle of someone else's code; 
> yesterday I learned more than I ever wished to about the innards of UTF-8 
> decoding, before realising it was all irrelevant. If you do think it worth 
> modifying the handling of this case, I would suggest replacing the call to 
> the debugger with a dialog box of some sort, perhaps with debug as an option 
> for the enthusiast, but perhaps also with an option to restart with an 
> alternative encoding.

I understand your idea, but showing dialogs from system code is a no go, all 
that we can do is throw better or more specific exceptions, it is up to the 
code invoking things (your code/application) to handle those.

> If I can, I would like to ask a supplementary question - I am deeply ignorant 
> but eager to learn. As I mentioned, I tried to get round the problem by 
> downloading the page source to a local file and reading from there into Soup, 
> but this also involved the Zinc decoder and so failed. I tried to see how to 
> get round this, using what I now know about the encoding, and came up with 
> the following:
> 
> binaryStream := (FileStream readOnlyFileNamed: 'display.html') binary.
> charStream := ZnCharacterReadStream on: binaryStream encoding: ZnByteEncoder 
> iso88591.
> hbSoup := Soup fromString: charStream contents.

Yes, that is perfect (did you see 
http://stfx.eu/EnterprisePharo/Zinc-Encoding-Meta/ ?)

I would write

'display.html' asFileReference binaryReadStreamDo: [ :in |
  Soup fromString: (ZnCharacterReadStream on: in encoding: ZnByteEncoder 
iso88591) upToEnd ].

or even

'display.html' asFileReference binaryReadStreamDo: [ :in |
  Soup fromString: (ZnByteEncoder iso88591 decodeBytes: in upToEnd) ].

It seems Soup does not accept streams, only strings.

> This worked, so in that sense it is OK, but I wonder if there is a neater way 
> of doing it. More importantly, I found that Soup has its own decoder, so I 
> can skip the second line and replace the third by:
> 
> hbSoup := Soup fromString: binaryStream contents asString.
> 
> At one stage I found myself looking at a debugger on this process (I know - 
> this contradicts what I said above!), because I had not realised that 
> 'asString' was needed. It looked as though Soup was trying three candidate 
> encodings, which it labelled 'latin1', 'utf-8' and 'cp1252', to find which 
> one would work. It showed the one it had 'sniffed' as most likely being 
> 'latin1', which I think is the same as ISO-8859-1, so it was trying that 
> first.

Yes Latin1, cp1252 and ISO88591 are equivalent for most purposes. BTW, 
#asString is also more or less the same (the difference is that there is a 
'hole' in the encoding).
> 
> Given this, my question is whether Zinc would allow me to read from a web URL 
> as a binary stream, which I could then feed into the Soup decoder in the same 
> way. If I can, I would use this as my standard procedure; I expect to be 
> visiting a lot of sites, and it would be handy to be able to ignore the 
> encoding issue and hope that Soup can sort it out.

Downloading binary to a file goes like this:

ZnClient new 
  url: 
'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk';
  downloadTo: 'display.html'.

If you would want the bytes in memory, you could do:

| client bytes |
(client := ZnClient new) 
   streaming: true;
   get: 
'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk'.
bytes := client entity contents.
client close.
bytes.

> Finally, a general comment. Both this query and the one I posed earlier this 
> week, answered by Vincent Blondeau, showed that Pharo users can come to this 
> site and expect quick, friendly and expert help. I am retired and can devote 
> all the time I want to this, but you people must have day jobs as well! I am 
> really very grateful.

To get anywhere, we have to help each other.

Sven

> Best wishes
> 
> Peter Kenny
> 
> -----Original Message-----
> From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
> Sven Van Caekenberghe
> Sent: 09 May 2015 07:51
> To: Any question about pharo is welcome
> Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)
> 
> 
>> On 09 May 2015, at 02:18, PBKResearch <pe...@pbkresearch.co.uk> wrote:
>> 
>> Sven
>> 
>> Many thanks for the quick response. I always like to try to solve problems 
>> myself before appealing for help, so I had worked out what was wrong, but 
>> did not know how to tell Zinc to use a specific coding. I had tried by 
>> reading through your very full note on Zinc, but did not find the trick you 
>> describe - which works perfectly, of course.
> 
> Good, yes this is a more recent thing.
> 
>> It seems unfortunate that Zinc does not use the coding specified in the html 
>> head. Evidently browsers like Firefox must do it, since the page displays 
>> correctly. If it cannot be done, I think it would be helpful to reconsider 
>> the error message produced when the user is dumped out, because in this 
>> context it is misleading. I spent some time tracing debugger output, trying 
>> to work out what was wrong with the UTF-8, before I spotted that one of the 
>> bytes was displayed in character form as $ö, and began to suspect it might 
>> be a different coding; I finally confirmed this by reading the page source 
>> in Firefox.
> 
> Zn deals with HTTP, not with HTML, these are totally different things, a 
> browser obviously does both. But even then there is no easy way to do this, 
> apart from trying. Consider these two byte arrays:
> 
> #[85 84 70 56 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 195 182 108 
> 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 111 114 
> 115 195 164 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 46]
> 
> #[73 83 79 56 56 53 57 49 58 32 68 101 114 32 87 101 103 32 122 117 114 32 72 
> 246 108 108 101 32 105 115 116 32 109 105 116 32 103 117 116 101 110 32 86 
> 111 114 115 228 116 122 101 110 32 103 101 112 102 108 97 115 116 101 114 116 
> 46]
> 
> In them it says how you should decode them! 
> 
> The GT tools make this challenge easy because there is a tab that tries both 
> encodings, but in general this is hard to solve (efficiently).
> 
> But since Zn does not do HTML, it will never be added at that level.
> 
> I will think about the error, it might indeed be useful to tell the user that 
> a default encoding was chosen.
> 
>> Thanks again for your help.
> 
> You're welcome.
> 
>> Peter Kenny
>> 
>> -----Original Message-----
>> From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
>> Sven Van Caekenberghe
>> Sent: 08 May 2015 20:04
>> To: Any question about pharo is welcome
>> Subject: Re: [Pharo-users] Problem using Zinc in Pharo 4 (Moose 5.1)
>> 
>> Peter,
>> 
>> Thanks for the URL, it makes it much easier to help you.
>> 
>> The answer is easy: the server is incorrect, it serves a specific encoding 
>> without saying so.
>> 
>> Consider:
>> 
>> (ZnClient new 
>>  head: 
>> 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk';
>>  
>>  response) contentType.
>> 
>> => 'text/html'
>> 
>> If no charset/encoding is specified, the modern default is UTF-8, so Zn 
>> tries that but fails.
>> 
>> You can change the default for unspecified encoding as follows:
>> 
>> ZnDefaultCharacterEncoder 
>> value: ZnByteEncoder iso88591
>> during: [ 
>>   ZnClient new 
>>     get: 
>> 'http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk'
>>  ].
>> 
>> The server should have used the following mime type to avoid the confusion:
>> 
>> ZnMimeType textHtml charSet: #iso88591
>> 
>> => 'text/html;charset=iso88591'
>> 
>> HTH,
>> 
>> Sven
>> 
>> PS: the encoding inside the document cannot be used because (1) no 
>> interpretation inside documents is done and (2) at that point it is too 
>> late, the contents is already converted from bytes to characters
>> 
>>> On 08 May 2015, at 18:51, PBKResearch <pe...@pbkresearch.co.uk> wrote:
>>> 
>>> Hello
>>> 
>>> I have been trying to use Soup class>> fromUrl: to access the contents of a 
>>> web page. It halts with a message from Zinc about malformed UTF-8. The page 
>>> displays perfectly in Firefox, so I copied the page source from there to a 
>>> local file and tried to read it from there. Again a message from Zinc: 
>>> 'Invalid utf8 input detected'. It’s strange, because the page is not in 
>>> UTF-8. The head contains: <meta content="text/html; charset=ISO-8859-1" 
>>> http-equiv="Content-Type">. I have tried to find how to specify the 
>>> character set in reading files with Zinc, but without success.*
>>> 
>>> If it’s relevant, I am using Pharo4.0 Latest update: #40613, downloaded two 
>>> days ago. The address of the web page is: 
>>> http://kompakt.handelsblatt-service.com/ff/display.php?msgID=725164109&adr=pe...@pbkresearch.co.uk.
>>>  Other pages from the same source are loaded and analysed with no problem. 
>>> Processing this page seems to go off course as soon as it encounters the 
>>> character code 246, which is a correct o-umlaut in ISO-8859-1.
>>> 
>>> Any advice gratefully received.
>>> 
>>> Peter Kenny
>>> 
>>> *I would be happy with advice to RTFM, if someone would point out the 
>>> relevant bit of the FM.
>> 
>> 
>> 
> 
> 
> 


Reply via email to