Mark
On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode < unicode@unicode.org> wrote: > On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode > <unicode@unicode.org> wrote: > > > > I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > > > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# > > * The Grapheme Cluster Model seems to have a couple of disadvantages > that are not mentioned: > 1) The subunit of string is also a string (a short string conforming > to particular constraints). There's a need for *another* more atomic > mechanism for examining the internals of the grapheme cluster string. > I did mention this. > 2) The way an arbitrary string is divided into units when iterating > over it changes when the program is executed on a newer version of the > language runtime that is aware of newly-assigned codepoints from a > newer version of Unicode. > Good point. I did mention the EGC definitions changing, but should point out that if you have a string with unassigned characters in it, they may be clustered on future versions. Will add. > * The Python 3.3 model mentions the disadvantages of memory usage > cliffs but doesn't mention the associated perfomance cliffs. It would > be good to also mention that when a string manipulation causes the > storage to expand or contract, there's a performance impact that's not > apparent from the nature of the operation if the programmer's > intuition works on the assumption that the programmer is dealing with > UTF-32. > The focus was on immutable string models, but I didn't make that clear. Added some text. > > * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM > text node storage in Gecko, (I believe but am not 100% sure) V8 and, > optionally, HotSpot > ( > https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A > ). > That is, text has UTF-16 semantics, but if the high half of every code > unit in a string is zero, only the lower half is stored. This has > properties analogous to the Python 3.3 model, except non-BMP doesn't > expand to UTF-32 but uses UTF-16 surrogate pairs. > Thanks, will add. > > * I think the fact that systems that chose UTF-16 or UTF-32 have > implemented models that try to save storage by omitting leading zeros > and gaining complexity and performance cliffs as a result is a strong > indication that UTF-8 should be recommended for newly-designed systems > that don't suffer from a forceful legacy need to expose UTF-16 or > UTF-32 semantics. > > * I suggest splitting the "UTF-8 model" into three substantially > different models: > > 1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No > UTF-8-related operations are performed when ingesting byte-oriented > data. Byte buffers and text buffers are type-wise ambiguous. Only > iterating over byte data by code point gives the data the UTF-8 > interpretation. Unless the data is cleaned up as a side effect of such > iteration, malformed sequences in input survive into output. > > 2) UTF-8 without full trust in ability to retain validity (the model > of the UTF-8-using C++ parts of Gecko; I believe this to be the most > common UTF-8 model for C and C++, but I don't have evidence to back > this up): When data is ingested with text semantics, it is converted > to UTF-8. For data that's supposed to already be in UTF-8, this means > replacing malformed sequences with the REPLACEMENT CHARACTER, so the > data is valid UTF-8 right after input. However, iteration by code > point doesn't trust ability of other code to retain UTF-8 validity > perfectly and has "else" branches in order not to blow up if invalid > UTF-8 creeps into the system. > > 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers > have a different type in the type system than byte buffers. To go from > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data > has been tagged as valid UTF-8, the validity is trusted completely so > that iteration by code point does not have "else" branches for > malformed sequences. If data that the type system indicates to be > valid UTF-8 wasn't actually valid, it would be nasal demon time. The > language has a default "safe" side and an opt-in "unsafe" side. The > unsafe side is for performing low-level operations in a way where the > responsibility of upholding invariants is moved from the compiler to > the programmer. It's impossible to violate the UTF-8 validity > invariant using the safe part of the language. > Added a quote based on this; please check if it is ok. > > * After working with different string models, I'd recommend the Rust > model for newly-designed programming languages. (Not because I work > for Mozilla but because I believe Rust's way of dealing with Unicode > is the best I've seen.) Rust's standard library provides Unicode > version-independent iterations over strings: by code unit and by code > point. Iteration by extended grapheme cluster is provided by a library > that's easy to include due to the nature of Rust package management > (https://crates.io/crates/unicode_segmentation). Viewing a UTF-8 > buffer as a read-only byte buffer has zero run-time cost and allows > for maximally fast guaranteed-valid-UTF-8 output. > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ > >