Re: [Pharo-dev] Unicode Support

EuanM Sun, 06 Dec 2015 19:38:12 -0800

Todd, As long as others are using it, it's useful to be able to send
UTF16, and to successfully import it.


I like systems that play well with others. :-)

On 5 December 2015 at 16:35, Todd Blanchard <[email protected]> wrote:
> would suggest that the only worthwhile encoding is UTF8 - the rest are
> distractions except for being able to read and convert from other encodings
> to UTF8. UTF16 is a complete waste of time.
>
> Read http://utf8everywhere.org/
>
> I have extensive Unicode chops from around 1999 to 2004 and my experience
> leads me to strongly agree with the views on that site.
>
>
> Sent from the road
>
> On Dec 5, 2015, at 05:08, stepharo <[email protected]> wrote:
>
> Hi EuanM
>
> Le 4/12/15 12:42, EuanM a écrit :
>
> I'm currently groping my way to seeing how feature-complete our
>
> Unicode support is.  I am doing this to establish what still needs to
>
> be done to provide full Unicode support.
>
>
> this is great. Thanks for pushing this. I wrote and collected some roadmap
> (analyses on different topics)
> on the pharo github project feel free to add this one there.
>
>
> This seems to me to be an area where it would be best to write it
>
> once, and then have the same codebase incorporated into the Smalltalks
>
> that most share a common ancestry.
>
>
> I am keen to get: equality-testing for strings; sortability for
>
> strings which have ligatures and diacritic characters; and correct
>
> round-tripping of data.
>
> Go!
> My suggestion is
>    start small
>    make steady progress
>    write tests
>    commit often :)
>
> Stef
>
> What is the french phoneBook ordering because this is the first time I hear
> about it.
>
>
> Call to action:
>
> ==========
>
>
> If you have comments on these proposals - such as "but we already have
>
> that facility" or "the reason we do not have these facilities is
>
> because they are dog-slow" - please let me know them.
>
>
> If you would like to help out, please let me know.
>
>
> If you have Unicode experience and expertise, and would like to be, or
>
> would be willing to be, in the  'council of experts' for this project,
>
> please let me know.
>
>
> If you have comments or ideas on anything mentioned in this email
>
>
> In the first instance, the initiative's website will be:
>
> http://smalltalk.uk.to/unicode.html
>
>
> I have created a SqueakSource.com project called UnicodeSupport
>
>
> I want to avoid re-inventing any facilities which already exist.
>
> Except where they prevent us reaching the goals of:
>
>   - sortable UTF8 strings
>
>   - sortable UTF16 strings
>
>   - equivalence testing of 2 UTF8 strings
>
>   - equivalence testing of 2 UTF16 strings
>
>   - round-tripping UTF8 strings through Smalltalk
>
>   - roundtripping UTF16 strings through Smalltalk.
>
> As I understand it, we have limited Unicode support atm.
>
>
> Current state of play
>
> ===============
>
> ByteString gets converted to WideString when need is automagically detected.
>
>
> Is there anything else that currently exists?
>
>
> Definition of Terms
>
> ==============
>
> A quick definition of terms before I go any further:
>
>
> Standard terms from the Unicode standard
>
> ===============================
>
> a compatibility character : an additional encoding of a *normal*
>
> character, for compatibility and round-trip conversion purposes.  For
>
> instance, a 1-byte encoding of a Latin character with a diacritic.
>
>
> Made-up terms
>
> ============
>
> a convenience codepoint :  a single codepoint which represents an item
>
> that is also encoded as a string of codepoints.
>
>
> (I tend to use the terms compatibility character and compatibility
>
> codepoint interchangably.  The standard only refers to them as
>
> compatibility characters.  However, the standard is determined to
>
> emphasise that characters are abstract and that codepoints are
>
> concrete.  So I think it is often more useful and productive to think
>
> of compatibility or convenience codepoints).
>
>
> a composed character :  a character made up of several codepoints
>
>
> Unicode encoding explained
>
> =====================
>
> A convenience codepoint can therefore be thought of as a code point
>
> used for a character which also has a composed form.
>
>
> The way Unicode works is that sometimes you can encode a character in
>
> one byte, sometimes not.  Sometimes you can encode it in two bytes,
>
> sometimes not.
>
>
> You can therefore have a long stream of ASCII which is single-byte
>
> Unicode.  If there is an occasional Cyrillic or Greek character in the
>
> stream, it would be represented either by a compatibility character or
>
> by a multi-byte combination.
>
>
> Using compatibility characters can prevent proper sorting and
>
> equivalence testing.
>
>
> Using "pure" Unicode, ie. "normal encodings", can cause compatibility
>
> and round-tripping probelms.  Although avoiding them can *also* cause
>
> compatibility issues and round-tripping problems.
>
>
> Currently my thinking is:
>
>
> a Utf8String class
>
> an Ordered collection, with 1 byte characters as the modal element,
>
> but short arrays of wider strings where necessary
>
> a Utf16String class
>
> an Ordered collection, with 2 byte characters as the modal element,
>
> but short arrays of wider strings
>
> beginning with a 2-byte endianness indicator.
>
>
> Utf8Strings sometimes need to be sortable, and sometimes need to be
> compatible.
>
>
> So my thinking is that Utf8String will contain convenience codepoints,
>
> for round-tripping.  And where there are multiple convenience
>
> codepoints for a character, that it standardises on one.
>
>
> And that there is a Utf8SortableString which uses *only* normal characters.
>
>
> We then need methods to convert between the two.
>
>
> aUtf8String asUtf8SortableString
>
>
> and
>
>
> aUtf8SortableString asUtf8String
>
>
>
> Sort orders are culture and context dependent - Sweden and Germany
>
> have different sort orders for the same diacritic-ed characters.  Some
>
> countries have one order in general usage, and another for specific
>
> usages, such as phone directories (e.g. UK and France)
>
>
> Similarly for Utf16 :  Utf16String and Utf16SortableString and
>
> conversion methods
>
>
> A list of sorted words would be a SortedCollection, and there could be
>
> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder,
>
> seOrder, ukOrder, etc
>
>
> along the lines of
>
> aListOfWords := SortedCollection sortBlock: deOrder
>
>
> If a word is either a Utf8SortableString, or a well-formed Utf8String,
>
> then we can perform equivalence testing on them trivially.
>
>
> To make sure a Utf8String is well formed, we would need to have a way
>
> of cleaning up any convenience codepoints which were valid, but which
>
> were for a character which has multiple equally-valid alternative
>
> convenience codepoints, and for which the string currently had the
>
> "wrong" convenience codepoint.  (i.e for any character with valid
>
> alternative convenience codepoints, we would choose one to be in the
>
> well-formed Utf8String, and we would need a method for cleaning the
>
> alternative convenience codepoints out of the string, and replacing
>
> them with the chosen approved convenience codepoint.
>
>
> aUtf8String cleanUtf8String
>
>
> With WideString, a lot of the issues disappear - except
>
> round-tripping(although I'm sure I have seen something recently about
>
> 4-byte strings that also have an additional bit.  Which would make
>
> some Unicode characters 5-bytes long.)
>
>
>
> (I'm starting to zone out now - if I've overlooked anything - obvious,
>
> subtle, or somewhere in between, please let me know)
>
>
> Cheers,
>
>     Euan
>
>
>
>
>

Re: [Pharo-dev] Unicode Support

Reply via email to