Re: [Pharo-dev] Unicode Support

Max Leske Sun, 06 Dec 2015 11:23:12 -0800

> On 06 Dec 2015, at 18:44, Sven Van Caekenberghe <[email protected]> wrote:
> 
> 
>> On 05 Dec 2015, at 17:35, Todd Blanchard <[email protected]> wrote:
>> 
>> would suggest that the only worthwhile encoding is UTF8 - the rest are 
>> distractions except for being able to read and convert from other encodings 
>> to UTF8. UTF16 is a complete waste of time. 
>> 
>> Read http://utf8everywhere.org/
>> 
>> I have extensive Unicode chops from around 1999 to 2004 and my experience 
>> leads me to strongly agree with the views on that site.
> 
> Well, I read the page/document/site as well. It was very interesting indeed, 
> thanks for sharing it.
> 
> In some sense it made me reconsider my aversion against in-image utf-8 
> encoding, maybe it could have some value. Absolute storage is more efficient, 
> some processing might also be more efficient, i/o conversions to/from utf-8 
> become a no-op. What I found nice is the suggestion that most structured 
> parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a 
> large part and just assume its ASCII, which would/could be nice for 
> performance. Also the fact that a lot of strings are (or should be) treated 
> as opaque makes a lot of sense.
> 
> What I did not like is that much of argumentation is based on issue in the 
> Windows world, take all that away and the document shrinks in half. I would 
> have liked a bit more fundamental CS arguments.
> 
> Canonicalisation and sorting issues are hardly discussed.
> 
> In one place, the fact that a lot of special characters can have multiple 
> representations is a big argument, while it is not mentioned how just 
> treating things like a byte sequence would solve this (it doesn't AFAIU). 
> Like how do you search for $e or $é if you know that it is possible to 
> represent $é as just $é and as $e + $´ ?


That’s what normalization is for: http://unicode.org/faq/normalization.html. It 
will generate the same codepoint for two strings where one contains the 
combining character and the other is a “single character”.

> 
> Sven
> 
>> Sent from the road
>> 
>> On Dec 5, 2015, at 05:08, stepharo <[email protected]> wrote:
>> 
>>> Hi EuanM
>>> 
>>> Le 4/12/15 12:42, EuanM a écrit :
>>>> I'm currently groping my way to seeing how feature-complete our
>>>> Unicode support is.  I am doing this to establish what still needs to
>>>> be done to provide full Unicode support.
>>> 
>>> this is great. Thanks for pushing this. I wrote and collected some roadmap 
>>> (analyses on different topics)
>>> on the pharo github project feel free to add this one there.
>>>> 
>>>> This seems to me to be an area where it would be best to write it
>>>> once, and then have the same codebase incorporated into the Smalltalks
>>>> that most share a common ancestry.
>>>> 
>>>> I am keen to get: equality-testing for strings; sortability for
>>>> strings which have ligatures and diacritic characters; and correct
>>>> round-tripping of data.
>>> Go!
>>> My suggestion is
>>>   start small
>>>   make steady progress
>>>   write tests
>>>   commit often :)
>>> 
>>> Stef
>>> 
>>> What is the french phoneBook ordering because this is the first time I hear 
>>> about it.
>>>> 
>>>> Call to action:
>>>> ==========
>>>> 
>>>> If you have comments on these proposals - such as "but we already have
>>>> that facility" or "the reason we do not have these facilities is
>>>> because they are dog-slow" - please let me know them.
>>>> 
>>>> If you would like to help out, please let me know.
>>>> 
>>>> If you have Unicode experience and expertise, and would like to be, or
>>>> would be willing to be, in the  'council of experts' for this project,
>>>> please let me know.
>>>> 
>>>> If you have comments or ideas on anything mentioned in this email
>>>> 
>>>> In the first instance, the initiative's website will be:
>>>> http://smalltalk.uk.to/unicode.html
>>>> 
>>>> I have created a SqueakSource.com project called UnicodeSupport
>>>> 
>>>> I want to avoid re-inventing any facilities which already exist.
>>>> Except where they prevent us reaching the goals of:
>>>>  - sortable UTF8 strings
>>>>  - sortable UTF16 strings
>>>>  - equivalence testing of 2 UTF8 strings
>>>>  - equivalence testing of 2 UTF16 strings
>>>>  - round-tripping UTF8 strings through Smalltalk
>>>>  - roundtripping UTF16 strings through Smalltalk.
>>>> As I understand it, we have limited Unicode support atm.
>>>> 
>>>> Current state of play
>>>> ===============
>>>> ByteString gets converted to WideString when need is automagically 
>>>> detected.
>>>> 
>>>> Is there anything else that currently exists?
>>>> 
>>>> Definition of Terms
>>>> ==============
>>>> A quick definition of terms before I go any further:
>>>> 
>>>> Standard terms from the Unicode standard
>>>> ===============================
>>>> a compatibility character : an additional encoding of a *normal*
>>>> character, for compatibility and round-trip conversion purposes.  For
>>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>>> 
>>>> Made-up terms
>>>> ============
>>>> a convenience codepoint :  a single codepoint which represents an item
>>>> that is also encoded as a string of codepoints.
>>>> 
>>>> (I tend to use the terms compatibility character and compatibility
>>>> codepoint interchangably.  The standard only refers to them as
>>>> compatibility characters.  However, the standard is determined to
>>>> emphasise that characters are abstract and that codepoints are
>>>> concrete.  So I think it is often more useful and productive to think
>>>> of compatibility or convenience codepoints).
>>>> 
>>>> a composed character :  a character made up of several codepoints
>>>> 
>>>> Unicode encoding explained
>>>> =====================
>>>> A convenience codepoint can therefore be thought of as a code point
>>>> used for a character which also has a composed form.
>>>> 
>>>> The way Unicode works is that sometimes you can encode a character in
>>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>>> sometimes not.
>>>> 
>>>> You can therefore have a long stream of ASCII which is single-byte
>>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>>> stream, it would be represented either by a compatibility character or
>>>> by a multi-byte combination.
>>>> 
>>>> Using compatibility characters can prevent proper sorting and
>>>> equivalence testing.
>>>> 
>>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>>> compatibility issues and round-tripping problems.
>>>> 
>>>> Currently my thinking is:
>>>> 
>>>> a Utf8String class
>>>> an Ordered collection, with 1 byte characters as the modal element,
>>>> but short arrays of wider strings where necessary
>>>> a Utf16String class
>>>> an Ordered collection, with 2 byte characters as the modal element,
>>>> but short arrays of wider strings
>>>> beginning with a 2-byte endianness indicator.
>>>> 
>>>> Utf8Strings sometimes need to be sortable, and sometimes need to be 
>>>> compatible.
>>>> 
>>>> So my thinking is that Utf8String will contain convenience codepoints,
>>>> for round-tripping.  And where there are multiple convenience
>>>> codepoints for a character, that it standardises on one.
>>>> 
>>>> And that there is a Utf8SortableString which uses *only* normal characters.
>>>> 
>>>> We then need methods to convert between the two.
>>>> 
>>>> aUtf8String asUtf8SortableString
>>>> 
>>>> and
>>>> 
>>>> aUtf8SortableString asUtf8String
>>>> 
>>>> 
>>>> Sort orders are culture and context dependent - Sweden and Germany
>>>> have different sort orders for the same diacritic-ed characters.  Some
>>>> countries have one order in general usage, and another for specific
>>>> usages, such as phone directories (e.g. UK and France)
>>>> 
>>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>>> conversion methods
>>>> 
>>>> A list of sorted words would be a SortedCollection, and there could be
>>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>>> seOrder, ukOrder, etc
>>>> 
>>>> along the lines of
>>>> aListOfWords := SortedCollection sortBlock: deOrder
>>>> 
>>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>>> then we can perform equivalence testing on them trivially.
>>>> 
>>>> To make sure a Utf8String is well formed, we would need to have a way
>>>> of cleaning up any convenience codepoints which were valid, but which
>>>> were for a character which has multiple equally-valid alternative
>>>> convenience codepoints, and for which the string currently had the
>>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>>> alternative convenience codepoints, we would choose one to be in the
>>>> well-formed Utf8String, and we would need a method for cleaning the
>>>> alternative convenience codepoints out of the string, and replacing
>>>> them with the chosen approved convenience codepoint.
>>>> 
>>>> aUtf8String cleanUtf8String
>>>> 
>>>> With WideString, a lot of the issues disappear - except
>>>> round-tripping(although I'm sure I have seen something recently about
>>>> 4-byte strings that also have an additional bit.  Which would make
>>>> some Unicode characters 5-bytes long.)
>>>> 
>>>> 
>>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>>> subtle, or somewhere in between, please let me know)
>>>> 
>>>> Cheers,
>>>>    Euan
>>>> 
>>>> 
>>> 
>>> 
> 
>

Re: [Pharo-dev] Unicode Support

Reply via email to