[Pharo-users] Re: How to handle (recover) from a ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding error?

Tim Mackinnon Tue, 20 Jul 2021 06:48:15 -0700

Hey thanks guys - so looking at readStreamEncoded: - how do I know what the 
valid encodings are? Skimming those doc’s Sven referenced, I can start to pick 
out some - but is there a list? I see that method parameter says “anEncoding” 
but the type hint on that is misleading as it seems like its a String or is it 
a Symbol? If I search for Encoder classes - I do find ZnCharacterEncoder - and 
it has class methods for latin1, utf8, ascii - so is this the definitive list? 
And should the encoding strings used in those methods be constants or something 
I can reference in my code?


Gosh - this raises a whole host of things I just naively assumed happened for 
me.

So it looks like the file giving me issues - seems to have characters like £ or 
¬ in it. So I’m wondering how I know what the proper encoding format would be 
(I think these files were written out with some PHP app) - is it just a trial 
and error thing?

I tried changing my code to:

details parseStream: (firmEfs readStreamEncoded: 'iso-8859-1’). - and other 
variants like ‘ASCII’ and ‘latin1’ - and this then gives me another error:
"ZnCharacterEncodingError: Character Unicode code point outside encoder range”

So it does sound like I have a file that isn’t conforming to known standards - 
and I guess I have to use #beLenient option.

Sven - In the examples for using #beLenient - you seem to show something that 
assumes you will iterate with Do - as my existing code takes a stream, that it 
wants to do a #nextLine on - would it be bad to do something like this:

efsStream := (firmEfs readStreamEncoded: 'latin1').
efsStream encoder beLenient.

details parsStream: efsStream.

That is - get the endcoder from my Stream and make it lenient? 
         
Appreciate the pointers on this guys - I’m definitely learning something new 
here.

Tim

> On 20 Jul 2021, at 12:11, Guillermo Polito <guillermopol...@gmail.com 
> <mailto:guillermopol...@gmail.com>> wrote:
> 
> 
> 
>> El 20 jul 2021, a las 11:45, Sven Van Caekenberghe <s...@stfx.eu 
>> <mailto:s...@stfx.eu>> escribió:
>> 
>> 
>> 
>>> On 20 Jul 2021, at 11:03, Sven Van Caekenberghe <s...@stfx.eu 
>>> <mailto:s...@stfx.eu>> wrote:
>>> 
>>> Hi Tim,
>>> 
>>> An introduction to this part of the system is in 
>>> https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html
>>>  
>>> <https://ci.inria.fr/pharo-contribution/job/EnterprisePharoBook/lastSuccessfulBuild/artifact/book-result/Zinc-Encoding-Meta/Zinc-Encoding-Meta.html>
>>>  [Character Encoding and Resource Meta Description] from the "Enterprise 
>>> Pharo" book.
>>> 
>>> The error means that a file that you try to read as UTF-8 does contain 
>>> things that are invalid with respect to the UTF-8 standard.
>>> 
>>> Are you sure the file is in UTF-8, maybe it is in ASCII, Latin-1 or 
>>> something else ?
>>> 
>>> It is possible to customise the encoding to something different than the 
>>> default UTF-8. For non-UTF encoders, there is a strict/lenient option to 
>>> disallow/allow illegal stuff (but then you will get these in your strings).
>>> 
>>> I can show you how to do that if you want.
>> 
>> '/var/log/system.log' asFileReference readStreamDo: [ :in | in upToEnd ].
>> 
>> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
>>      (ZnCharacterReadStream on: in encoding: #ascii) upToEnd ].
>> 
>> '/var/log/system.log' asFileReference binaryReadStreamDo: [ :in |
>>      (ZnCharacterReadStream on: in encoding: ZnCharacterEncoder ascii 
>> beLenient) upToEnd ].
> 
> There is also readStreamEncoded:[do:], which is a bit more concise but does 
> the same :)
> 
>> 
>> HTH
>> 
>>> Sven
>>> 
>>>> On 20 Jul 2021, at 10:31, Tim Mackinnon <tim@testit.works 
>>>> <mailto:tim@testit.works>> wrote:
>>>> 
>>>> Hi - I’m doing a bit of log file processing with Pharo - and I’ve hit an 
>>>> unexpected error and am wondering what the best way to approach it is.
>>>> 
>>>> It seems that I have a log file that has unexpected characters, and so my 
>>>> readStream loop that reads lines gets an error: "ZnInvalidUTF8: Illegal 
>>>> continuation byte for utf-8 encoding”.
>>>> 
>>>> For some reason this file (unlike my others) seems to contain characters 
>>>> that it shouldn’t - but what is the best way for me to continue 
>>>> processing? Should I be opening my files in a different way - or can I 
>>>> resume the error somehow- I’m not familiar with this area of Pharo and am 
>>>> after a bit of advice.
>>>> 
>>>> My code is like this (and I get the error when doing nextLine)
>>>> 
>>>> 
>>>> parseStream: aFileStream with: aBlock
>>>>    | line items |
>>>>    [ (line := aFileStream nextLine) isNil ]
>>>>            whileFalse: [ 
>>>>                    items := $/ split: line.
>>>>                    items size = 3 ifTrue: [aBlock value: items]]
>>>> 
>>>> My stream is created like this:
>>>> 
>>>> firmEfs := (pathName , '/' , firmName , '_files') asFileReference.
>>>> details parseStream: firmEfs readStream.
>>>> 
>>>> 
>>>> Should I be opening the stream a bit differently - or can I catch that 
>>>> encoding error and resume it with some safe character?
>>>> 
>>>> Thanks for any help.
>>>> 
>>>> Tim

[Pharo-users] Re: How to handle (recover) from a ZnInvalidUTF8: Illegal continuation byte for utf-8 encoding error?

Reply via email to