Re: [Pharo-dev] Unicode Support

Sven Van Caekenberghe Sun, 06 Dec 2015 09:45:27 -0800

> On 05 Dec 2015, at 17:35, Todd Blanchard <[email protected]> wrote:
> 
> would suggest that the only worthwhile encoding is UTF8 - the rest are 
> distractions except for being able to read and convert from other encodings 
> to UTF8. UTF16 is a complete waste of time. 
> 
> Read http://utf8everywhere.org/
> 
> I have extensive Unicode chops from around 1999 to 2004 and my experience 
> leads me to strongly agree with the views on that site.


Well, I read the page/document/site as well. It was very interesting indeed, 
thanks for sharing it.

In some sense it made me reconsider my aversion against in-image utf-8 
encoding, maybe it could have some value. Absolute storage is more efficient, 
some processing might also be more efficient, i/o conversions to/from utf-8 
become a no-op. What I found nice is the suggestion that most structured 
parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a 
large part and just assume its ASCII, which would/could be nice for 
performance. Also the fact that a lot of strings are (or should be) treated as 
opaque makes a lot of sense.

What I did not like is that much of argumentation is based on issue in the 
Windows world, take all that away and the document shrinks in half. I would 
have liked a bit more fundamental CS arguments.

Canonicalisation and sorting issues are hardly discussed.

In one place, the fact that a lot of special characters can have multiple 
representations is a big argument, while it is not mentioned how just treating 
things like a byte sequence would solve this (it doesn't AFAIU). Like how do 
you search for $e or $é if you know that it is possible to represent $é as just 
$é and as $e + $´ ?

Sven

> Sent from the road
> 
> On Dec 5, 2015, at 05:08, stepharo <[email protected]> wrote:
> 
>> Hi EuanM
>> 
>> Le 4/12/15 12:42, EuanM a écrit :
>>> I'm currently groping my way to seeing how feature-complete our
>>> Unicode support is.  I am doing this to establish what still needs to
>>> be done to provide full Unicode support.
>> 
>> this is great. Thanks for pushing this. I wrote and collected some roadmap 
>> (analyses on different topics)
>> on the pharo github project feel free to add this one there.
>>> 
>>> This seems to me to be an area where it would be best to write it
>>> once, and then have the same codebase incorporated into the Smalltalks
>>> that most share a common ancestry.
>>> 
>>> I am keen to get: equality-testing for strings; sortability for
>>> strings which have ligatures and diacritic characters; and correct
>>> round-tripping of data.
>> Go!
>> My suggestion is
>>    start small
>>    make steady progress
>>    write tests
>>    commit often :)
>> 
>> Stef
>> 
>> What is the french phoneBook ordering because this is the first time I hear 
>> about it.
>>> 
>>> Call to action:
>>> ==========
>>> 
>>> If you have comments on these proposals - such as "but we already have
>>> that facility" or "the reason we do not have these facilities is
>>> because they are dog-slow" - please let me know them.
>>> 
>>> If you would like to help out, please let me know.
>>> 
>>> If you have Unicode experience and expertise, and would like to be, or
>>> would be willing to be, in the  'council of experts' for this project,
>>> please let me know.
>>> 
>>> If you have comments or ideas on anything mentioned in this email
>>> 
>>> In the first instance, the initiative's website will be:
>>> http://smalltalk.uk.to/unicode.html
>>> 
>>> I have created a SqueakSource.com project called UnicodeSupport
>>> 
>>> I want to avoid re-inventing any facilities which already exist.
>>> Except where they prevent us reaching the goals of:
>>>   - sortable UTF8 strings
>>>   - sortable UTF16 strings
>>>   - equivalence testing of 2 UTF8 strings
>>>   - equivalence testing of 2 UTF16 strings
>>>   - round-tripping UTF8 strings through Smalltalk
>>>   - roundtripping UTF16 strings through Smalltalk.
>>> As I understand it, we have limited Unicode support atm.
>>> 
>>> Current state of play
>>> ===============
>>> ByteString gets converted to WideString when need is automagically detected.
>>> 
>>> Is there anything else that currently exists?
>>> 
>>> Definition of Terms
>>> ==============
>>> A quick definition of terms before I go any further:
>>> 
>>> Standard terms from the Unicode standard
>>> ===============================
>>> a compatibility character : an additional encoding of a *normal*
>>> character, for compatibility and round-trip conversion purposes.  For
>>> instance, a 1-byte encoding of a Latin character with a diacritic.
>>> 
>>> Made-up terms
>>> ============
>>> a convenience codepoint :  a single codepoint which represents an item
>>> that is also encoded as a string of codepoints.
>>> 
>>> (I tend to use the terms compatibility character and compatibility
>>> codepoint interchangably.  The standard only refers to them as
>>> compatibility characters.  However, the standard is determined to
>>> emphasise that characters are abstract and that codepoints are
>>> concrete.  So I think it is often more useful and productive to think
>>> of compatibility or convenience codepoints).
>>> 
>>> a composed character :  a character made up of several codepoints
>>> 
>>> Unicode encoding explained
>>> =====================
>>> A convenience codepoint can therefore be thought of as a code point
>>> used for a character which also has a composed form.
>>> 
>>> The way Unicode works is that sometimes you can encode a character in
>>> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>>> sometimes not.
>>> 
>>> You can therefore have a long stream of ASCII which is single-byte
>>> Unicode.  If there is an occasional Cyrillic or Greek character in the
>>> stream, it would be represented either by a compatibility character or
>>> by a multi-byte combination.
>>> 
>>> Using compatibility characters can prevent proper sorting and
>>> equivalence testing.
>>> 
>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>>> and round-tripping probelms.  Although avoiding them can *also* cause
>>> compatibility issues and round-tripping problems.
>>> 
>>> Currently my thinking is:
>>> 
>>> a Utf8String class
>>> an Ordered collection, with 1 byte characters as the modal element,
>>> but short arrays of wider strings where necessary
>>> a Utf16String class
>>> an Ordered collection, with 2 byte characters as the modal element,
>>> but short arrays of wider strings
>>> beginning with a 2-byte endianness indicator.
>>> 
>>> Utf8Strings sometimes need to be sortable, and sometimes need to be 
>>> compatible.
>>> 
>>> So my thinking is that Utf8String will contain convenience codepoints,
>>> for round-tripping.  And where there are multiple convenience
>>> codepoints for a character, that it standardises on one.
>>> 
>>> And that there is a Utf8SortableString which uses *only* normal characters.
>>> 
>>> We then need methods to convert between the two.
>>> 
>>> aUtf8String asUtf8SortableString
>>> 
>>> and
>>> 
>>> aUtf8SortableString asUtf8String
>>> 
>>> 
>>> Sort orders are culture and context dependent - Sweden and Germany
>>> have different sort orders for the same diacritic-ed characters.  Some
>>> countries have one order in general usage, and another for specific
>>> usages, such as phone directories (e.g. UK and France)
>>> 
>>> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>>> conversion methods
>>> 
>>> A list of sorted words would be a SortedCollection, and there could be
>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>>> seOrder, ukOrder, etc
>>> 
>>> along the lines of
>>> aListOfWords := SortedCollection sortBlock: deOrder
>>> 
>>> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>>> then we can perform equivalence testing on them trivially.
>>> 
>>> To make sure a Utf8String is well formed, we would need to have a way
>>> of cleaning up any convenience codepoints which were valid, but which
>>> were for a character which has multiple equally-valid alternative
>>> convenience codepoints, and for which the string currently had the
>>> "wrong" convenience codepoint.  (i.e for any character with valid
>>> alternative convenience codepoints, we would choose one to be in the
>>> well-formed Utf8String, and we would need a method for cleaning the
>>> alternative convenience codepoints out of the string, and replacing
>>> them with the chosen approved convenience codepoint.
>>> 
>>> aUtf8String cleanUtf8String
>>> 
>>> With WideString, a lot of the issues disappear - except
>>> round-tripping(although I'm sure I have seen something recently about
>>> 4-byte strings that also have an additional bit.  Which would make
>>> some Unicode characters 5-bytes long.)
>>> 
>>> 
>>> (I'm starting to zone out now - if I've overlooked anything - obvious,
>>> subtle, or somewhere in between, please let me know)
>>> 
>>> Cheers,
>>>     Euan
>>> 
>>> 
>> 
>>

Re: [Pharo-dev] Unicode Support

Reply via email to