> On 06 Dec 2015, at 18:44, Sven Van Caekenberghe <[email protected]> wrote: > > >> On 05 Dec 2015, at 17:35, Todd Blanchard <[email protected]> wrote: >> >> would suggest that the only worthwhile encoding is UTF8 - the rest are >> distractions except for being able to read and convert from other encodings >> to UTF8. UTF16 is a complete waste of time. >> >> Read http://utf8everywhere.org/ >> >> I have extensive Unicode chops from around 1999 to 2004 and my experience >> leads me to strongly agree with the views on that site. > > Well, I read the page/document/site as well. It was very interesting indeed, > thanks for sharing it. > > In some sense it made me reconsider my aversion against in-image utf-8 > encoding, maybe it could have some value. Absolute storage is more efficient, > some processing might also be more efficient, i/o conversions to/from utf-8 > become a no-op. What I found nice is the suggestion that most structured > parsing (XML, JSON, CSV, STON, ...) could actually ignore the encoding for a > large part and just assume its ASCII, which would/could be nice for > performance. Also the fact that a lot of strings are (or should be) treated > as opaque makes a lot of sense. > > What I did not like is that much of argumentation is based on issue in the > Windows world, take all that away and the document shrinks in half. I would > have liked a bit more fundamental CS arguments. > > Canonicalisation and sorting issues are hardly discussed. > > In one place, the fact that a lot of special characters can have multiple > representations is a big argument, while it is not mentioned how just > treating things like a byte sequence would solve this (it doesn't AFAIU). > Like how do you search for $e or $é if you know that it is possible to > represent $é as just $é and as $e + $´ ?
That’s what normalization is for: http://unicode.org/faq/normalization.html. It will generate the same codepoint for two strings where one contains the combining character and the other is a “single character”. > > Sven > >> Sent from the road >> >> On Dec 5, 2015, at 05:08, stepharo <[email protected]> wrote: >> >>> Hi EuanM >>> >>> Le 4/12/15 12:42, EuanM a écrit : >>>> I'm currently groping my way to seeing how feature-complete our >>>> Unicode support is. I am doing this to establish what still needs to >>>> be done to provide full Unicode support. >>> >>> this is great. Thanks for pushing this. I wrote and collected some roadmap >>> (analyses on different topics) >>> on the pharo github project feel free to add this one there. >>>> >>>> This seems to me to be an area where it would be best to write it >>>> once, and then have the same codebase incorporated into the Smalltalks >>>> that most share a common ancestry. >>>> >>>> I am keen to get: equality-testing for strings; sortability for >>>> strings which have ligatures and diacritic characters; and correct >>>> round-tripping of data. >>> Go! >>> My suggestion is >>> start small >>> make steady progress >>> write tests >>> commit often :) >>> >>> Stef >>> >>> What is the french phoneBook ordering because this is the first time I hear >>> about it. >>>> >>>> Call to action: >>>> ========== >>>> >>>> If you have comments on these proposals - such as "but we already have >>>> that facility" or "the reason we do not have these facilities is >>>> because they are dog-slow" - please let me know them. >>>> >>>> If you would like to help out, please let me know. >>>> >>>> If you have Unicode experience and expertise, and would like to be, or >>>> would be willing to be, in the 'council of experts' for this project, >>>> please let me know. >>>> >>>> If you have comments or ideas on anything mentioned in this email >>>> >>>> In the first instance, the initiative's website will be: >>>> http://smalltalk.uk.to/unicode.html >>>> >>>> I have created a SqueakSource.com project called UnicodeSupport >>>> >>>> I want to avoid re-inventing any facilities which already exist. >>>> Except where they prevent us reaching the goals of: >>>> - sortable UTF8 strings >>>> - sortable UTF16 strings >>>> - equivalence testing of 2 UTF8 strings >>>> - equivalence testing of 2 UTF16 strings >>>> - round-tripping UTF8 strings through Smalltalk >>>> - roundtripping UTF16 strings through Smalltalk. >>>> As I understand it, we have limited Unicode support atm. >>>> >>>> Current state of play >>>> =============== >>>> ByteString gets converted to WideString when need is automagically >>>> detected. >>>> >>>> Is there anything else that currently exists? >>>> >>>> Definition of Terms >>>> ============== >>>> A quick definition of terms before I go any further: >>>> >>>> Standard terms from the Unicode standard >>>> =============================== >>>> a compatibility character : an additional encoding of a *normal* >>>> character, for compatibility and round-trip conversion purposes. For >>>> instance, a 1-byte encoding of a Latin character with a diacritic. >>>> >>>> Made-up terms >>>> ============ >>>> a convenience codepoint : a single codepoint which represents an item >>>> that is also encoded as a string of codepoints. >>>> >>>> (I tend to use the terms compatibility character and compatibility >>>> codepoint interchangably. The standard only refers to them as >>>> compatibility characters. However, the standard is determined to >>>> emphasise that characters are abstract and that codepoints are >>>> concrete. So I think it is often more useful and productive to think >>>> of compatibility or convenience codepoints). >>>> >>>> a composed character : a character made up of several codepoints >>>> >>>> Unicode encoding explained >>>> ===================== >>>> A convenience codepoint can therefore be thought of as a code point >>>> used for a character which also has a composed form. >>>> >>>> The way Unicode works is that sometimes you can encode a character in >>>> one byte, sometimes not. Sometimes you can encode it in two bytes, >>>> sometimes not. >>>> >>>> You can therefore have a long stream of ASCII which is single-byte >>>> Unicode. If there is an occasional Cyrillic or Greek character in the >>>> stream, it would be represented either by a compatibility character or >>>> by a multi-byte combination. >>>> >>>> Using compatibility characters can prevent proper sorting and >>>> equivalence testing. >>>> >>>> Using "pure" Unicode, ie. "normal encodings", can cause compatibility >>>> and round-tripping probelms. Although avoiding them can *also* cause >>>> compatibility issues and round-tripping problems. >>>> >>>> Currently my thinking is: >>>> >>>> a Utf8String class >>>> an Ordered collection, with 1 byte characters as the modal element, >>>> but short arrays of wider strings where necessary >>>> a Utf16String class >>>> an Ordered collection, with 2 byte characters as the modal element, >>>> but short arrays of wider strings >>>> beginning with a 2-byte endianness indicator. >>>> >>>> Utf8Strings sometimes need to be sortable, and sometimes need to be >>>> compatible. >>>> >>>> So my thinking is that Utf8String will contain convenience codepoints, >>>> for round-tripping. And where there are multiple convenience >>>> codepoints for a character, that it standardises on one. >>>> >>>> And that there is a Utf8SortableString which uses *only* normal characters. >>>> >>>> We then need methods to convert between the two. >>>> >>>> aUtf8String asUtf8SortableString >>>> >>>> and >>>> >>>> aUtf8SortableString asUtf8String >>>> >>>> >>>> Sort orders are culture and context dependent - Sweden and Germany >>>> have different sort orders for the same diacritic-ed characters. Some >>>> countries have one order in general usage, and another for specific >>>> usages, such as phone directories (e.g. UK and France) >>>> >>>> Similarly for Utf16 : Utf16String and Utf16SortableString and >>>> conversion methods >>>> >>>> A list of sorted words would be a SortedCollection, and there could be >>>> pre-prepared sortBlocks for them, e.g. frPhoneBookOrder, deOrder, >>>> seOrder, ukOrder, etc >>>> >>>> along the lines of >>>> aListOfWords := SortedCollection sortBlock: deOrder >>>> >>>> If a word is either a Utf8SortableString, or a well-formed Utf8String, >>>> then we can perform equivalence testing on them trivially. >>>> >>>> To make sure a Utf8String is well formed, we would need to have a way >>>> of cleaning up any convenience codepoints which were valid, but which >>>> were for a character which has multiple equally-valid alternative >>>> convenience codepoints, and for which the string currently had the >>>> "wrong" convenience codepoint. (i.e for any character with valid >>>> alternative convenience codepoints, we would choose one to be in the >>>> well-formed Utf8String, and we would need a method for cleaning the >>>> alternative convenience codepoints out of the string, and replacing >>>> them with the chosen approved convenience codepoint. >>>> >>>> aUtf8String cleanUtf8String >>>> >>>> With WideString, a lot of the issues disappear - except >>>> round-tripping(although I'm sure I have seen something recently about >>>> 4-byte strings that also have an additional bit. Which would make >>>> some Unicode characters 5-bytes long.) >>>> >>>> >>>> (I'm starting to zone out now - if I've overlooked anything - obvious, >>>> subtle, or somewhere in between, please let me know) >>>> >>>> Cheers, >>>> Euan >>>> >>>> >>> >>> > >
