Philippe,
I would even go further, the test
UTF8TextConverterTest>>#testPutSingleCharacter is plain wrong, it seems to have
been written as an afterthought:
testPutSingleCharacter
| actual |
actual := ByteArray streamContents: [ :stream |
| converter |
converter := UTF8TextConverter new.
converter
nextPut: $a
toStream: stream.
converter
nextPut: (Unicode value: 16r20AC)
toStream: stream ].
self assert: actual = #[97 0 0 32 172]
The correct result is #[97 226 130 172] !!
If you would give these bytes to any other system, they would never be able to
decode them as UTF-8.
Sven
On 30 Nov 2010, at 08:46, Philippe Marschall wrote:
> On 11/29/2010 03:23 PM, Sven Van Caekenberghe wrote:
>> Hi,
>>
>> TextConverter and its subclasses seem to break the contract of
>> #nextFromStream: and #nextPut:toStream: when the stream #isBinary. Consider
>> the following two examples:
>>
>> ByteArray streamContents: [ :stream | | encoder |
>> encoder := UTF8TextConverter new.
>> 'élève en français' do: [ :each | encoder nextPut: each toStream:
>> stream ] ].
>>
>> #[233 108 232 118 101 32 101 110 32 102 114 97 110 231 97 105 115]
>>
>> (String streamContents: [ :stream | | encoder |
>> encoder := UTF8TextConverter new.
>> 'élève en français' do: [ :each | encoder nextPut: each toStream:
>> stream ] ]) asByteArray.
>>
>> #[195 169 108 195 168 118 101 32 101 110 32 102 114 97 110 195 167 97 105
>> 115]
>>
>> The first answer is incorrect, the second is correct (as far as I understand
>> it).
>
> I would agree.
>
>> This is apparently on purpose, from the implementation of, for example,
>> UTF8TextConverter>>#nextPut:toStream:
>>
>> nextPut: aCharacter toStream: aStream
>> | leadingChar nBytes mask shift ucs2code |
>> aStream isBinary ifTrue: [^aCharacter storeBinaryOn: aStream].
>> leadingChar := aCharacter leadingChar.
>> (leadingChar = 0 and: [aCharacter asciiValue < 128]) ifTrue: [
>> aStream basicNextPut: aCharacter.
>> ^ aStream.
>> ].
>>
>> "leadingChar > 3 ifTrue: [^ aStream]."
>>
>> ucs2code := aCharacter asUnicode.
>> ucs2code ifNil: [^ aStream].
>>
>> nBytes := ucs2code highBit + 3 // 5.
>> mask := #(128 192 224 240 248 252 254 255) at: nBytes.
>> shift := nBytes - 1 * -6.
>> aStream basicNextPut: (Character value: (ucs2code bitShift: shift) +
>> mask).
>> 2 to: nBytes do: [:i |
>> shift := shift + 6.
>> aStream basicNextPut: (Character value: ((ucs2code bitShift:
>> shift) bitAnd: 63) + 128).
>> ].
>>
>> ^ aStream.
>>
>> I would say that the contract of #nextPut:toStream: is to take a Character
>> object and write a binary representation using a specific encoding to a
>> stream. However, when given a #isBinary stream, it does no longer do any
>> encoding at all !
>
> There are many cases where Squeak/Pharo do silent coercion, where
> obviously something went wrong (you pass a ByteArray where a String is
> expected or vice versa) and the system tries to be clever about it and
> do something else than raising an exception and forcing you to fix your
> code. The result is often wrong because the system can not guess what
> you wanted to do.
>
>> The same is true for the other converters as well as for #nextFromStream:.
>>
>> Does anyone know why that is the case ?
>>
>> And if it is by design, how should one do UTF8 encoding on a binary stream ??
>>
>> Thx,
>>
>> Sven
>>
>> PS: If others also think this is strange, I could make an issue, I am just
>> not sure this is a bug.
>
> I find that strange as well.
>
> Cheers
> Philippe