Re: [Pharo-project] String input not in UTF-8

Hilaire Fernandes Fri, 05 Aug 2011 07:42:31 -0700

I gave a look at the latest XMLParser but the API is different with a
lot broken code on my face. Does XMLWriter class>>on: obsolete ? It bugs
me with that but the class and method are still there, a Monticello
trick I forget about?
I don't even now how to port to new API. Is there a port guide?
 I guess this is for the better, but still frustrating and distracting
from the main task...




Le 05/08/2011 16:23, Henrik Johansen a écrit :
> 
> On Aug 5, 2011, at 3:41 54PM, Hilaire Fernandes wrote:
> 
>> Le 05/08/2011 13:28, Henrik Johansen a écrit :
>>>
>>> On Aug 5, 2011, at 1:14 35PM, Hilaire Fernandes wrote:
>>>
>>>> It seems like when inputing accented character it is not by default in
>>>> UTF-8.
>>>> Is it the case with Pharo 1.3 ?
>>>>
>>>> Hilaire
>>>>
>>>>
>>>> -- 
>>>> Education 0.2 -- http://blog.ofset.org/hilaire
>>>
>>> I'm not sure what you mean.
>>> When in image, all the way from InputEvents to String representation, you 
>>> only deal with Unicode codePoints.
>>
>> Is seems it is 8 bits chars, when exported through XMLParser, it is
>> 8bits string. I need to investigate further.
>>
>> Hilaire
> It is an 8-bit character, since the codePoint fits in one byte. (see a)
> Accented characters like é could be either:
> a) One Unicode codepoint (U+00E9 (decimal 233) small acute e )
> b) Two Unicode codepoints ( U+0301 (decimal 769) combining acute accent + 
> U0065 (decimal 101) small e ).
> 
> Internally, you'd see strings with character values corresponding to those 
> listed as decimal, ie the unicode codePoints.
> b) would be a WideString, as 769 does not fit in a byte.
> 
> However, if  correctly converted to UTF8, their representations should be;
> a)  represented in 2 bytes ;       16r C3A9
> b)  represented  in 3 bytes:  16r CD81 65.
> 
> Ie. it seems XMLParser does not encode it properly to utf8 when exporting.
> Note: This is perfectly legal if the document contains an encoding attribute 
> specifying a one-byte encoding like iso-8859-1 or windows-1252.
> (starts with <?xml version="1.0" encoding="windows-1252" ?> or some such)
> Absent such an attribute, or a BOM indicating another Unicode encoding 
> though, it is a bug.
> 
> Cheers,
> Henry
> 
> 
> 


-- 
Education 0.2 -- http://blog.ofset.org/hilaire

Re: [Pharo-project] String input not in UTF-8

Reply via email to