Re: [Pharo-project] UTF converter fails to read tildes and ñ

Sven Van Caekenberghe Tue, 05 Jul 2011 04:35:03 -0700

On 04 Jul 2011, at 20:44, Guillermo Polito wrote:

> Actually, this testcase may be wrong, but reproduces the problem when doing a 
> fileout with tildes and ñ's in packages or authors...
> 
> BTW, it does not fail on the assert, it raises an exception when sending the 
> #nextFromStream:  :S


Well, this is a correct test case:

| converter string |
converter := UTF8TextConverter new.
string := String streamContents: [ :stream | converter nextPut: $a toStream: 
stream ].
$a = (converter nextFromStream: string readStream).

| converter string character |
converter := UTF8TextConverter new.
character := Character value: 241. "lowercase n with diacritical tilde, in HTML 
&ntilde;"
string := String streamContents: [ :stream | converter nextPut: character 
toStream: stream ].
character = (converter nextFromStream: string readStream).

The silly/stupid thing with the TextConverter hierarchy is that it encodes 
characters onto a character stream that it treats as a binary stream (i.e. a 
byte with value 200 decimal is stored as a Character with value 200). To 
decode, its needs a character stream but treats it as if it contained bytes!

If you replace the String with ByteArray in the above it completely fails to 
act as expected. Look into the code and you'll be surprised.

On the other hand, the ZnCharacterEncoder hierachy acts as a real (and simpler) 
encoder/decoder from characters to bytes and vice versa:

| converter bytes character |
converter := ZnUTF8Encoder new.
character := Character value: 241. "lowercase n with diacritical tilde, in HTML 
&ntilde;"
bytes := ByteArray streamContents: [ :stream | converter nextPut: character 
toStream: stream ].
character = (converter nextFromStream: bytes readStream).

BTW, the main reason for introducing the ZnCharacterEncoder hierachy was 
because I needed a way to compute how many bytes of encoding a string needed 
before encoding it (see #encodedByteCountFor:), a non trivial operation for a 
variable length encoding like UTF8, but the messed up API was another one.

Sven

PS: I have *not* said that there is an encoding fault in UTF8TextConverter, 
just that the APi is freaky.

Re: [Pharo-project] UTF converter fails to read tildes and ñ

Reply via email to