Status: New
Owner: ----

New issue 3360 by sven.van.caekenberghe: TextConverter handling of binary streams is wrong
http://code.google.com/p/pharo/issues/detail?id=3360

It seems that the way binary (#isBinary true) streams are handled by TextConverter and its subclasses is wrong. When given a binary stream, the core text converter methods (#nextPut:toStream and #nextFromStream:) simply do no longer encode or decode at all.

Moreover, the unit test UTF8TextConverter>>#testPutSingleCharacter seems plain wrong. The actual encoded bytes should be #[97 226 130 172].

However, this behavior seems to be added by design, so it is hard to estimate the impact of changing this.

It is currently very ugly to get a binary UTF-8 encoding, one has to write to a character stream and then turn those characters into bytes.

I wrote an alternative UTF-8 encoder as a support class to the Zinc HTTP Components (http://www.squeaksource.com/ZincHTTPComponents.html) together with the following unit test:

testUTF8Encoder
"The examples are taken from http://en.wikipedia.org/wiki/UTF-8#Description";
        
        | encoder inputBytes outputBytes inputString outputString |
        encoder := ZnUTF8Encoder new.
inputString := String with: $$ with: (Unicode value: 16r00A2) with: (Unicode value: 16r20AC) with: (Unicode value: 16r024B62). inputBytes := #[16r24 16rC2 16rA2 16rE2 16r82 16rAC 16rF0 16rA4 16rAD 16rA2].
        outputBytes := self encodeString: inputString with: encoder.
        self assert: outputBytes = inputBytes.
        outputString := self decodeBytes: inputBytes with: encoder.
        self assert: outputString = inputString

based on the helper methods:

encodeString: string with: encoder
        ^ ByteArray streamContents: [ :stream |
                string do: [ :each |
                        encoder nextPut: each toStream: stream ] ]

decodeBytes: bytes with: encoder
        | input |
        input := bytes readStream.
        ^ String streamContents: [ :stream |
                [ input atEnd ] whileFalse: [
                        stream nextPut: (encoder nextFromStream: input) ] ]

The new encoder code is simpler, but might not handle everything that is needed (leading chars, language codes), but is all that still needed ?

Sven






Reply via email to