Hi Guille, > On Jan 18, 2019, at 6:04 AM, Guillermo Polito <[email protected]> > wrote: > >> On Fri, Jan 18, 2019 at 2:46 PM Ben Coman <[email protected]> wrote: >> >>> On Fri, 18 Jan 2019 at 21:39, Sven Van Caekenberghe <[email protected]> wrote: >>> >>> > On 18 Jan 2019, at 14:23, Guillermo Polito <[email protected]> >>> > wrote: >>> > >>> > >>> > I think that will just overcomplicate things. Right now, all Strings in >>> > Pharo are unicode strings. >> >> Cool. I didn't realise that. But to be pedantic, which unicode encoding? >> Should I presume from Sven's "UTF-8 encoding step" comment below >> and the WideString class comment "This class represents the array of 32 bit >> wide characters" >> that the WideString encoding is UTF-32? So should its comment be updated to >> advise that? > > None :D > > That's the funny thing, they are not encoded. > > Actually, you should see Strings as collections of Characters, and Characters > defined in terms of their abstract code points. > ByteStrings are an optimized (just more compact) version that stores > codepoints that fit in a byte.
And Spur supports 16-bit strings too, which would be versions that store code points that fit in doublebytes. >> cheers -ben >> >>> Characters are represented with their corresponding unicode codepoint. >>> > If all characters in a string have codepoints < 256 then they are just >>> > stored in a bytestring. Otherwise they are WideStrings. >>> > >>> > I think assuming a single representation for strings, and then encode >>> > when interacting with external apps/APIs is MUCH simpler. >>> >>> Absolutely ! >>> >>> (and yes I know that for outgoing FFI calls that might mean a UTF-8 >>> encoding step, so be it). > > > -- > > Guille Polito > Research Engineer > Centre de Recherche en Informatique, Signal et Automatique de Lille > CRIStAL - UMR 9189 > French National Center for Scientific Research - http://www.cnrs.fr > > Web: http://guillep.github.io > Phone: +33 06 52 70 66 13
