23.01.2019, 16:55, "Edward Welbourne" <edward.welbou...@qt.io>: > All of this discussion ignores a major elephant: QString's indexing is > by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode > for a couple of decades now. > > We *should* have a string type (I don't care what you call it) that acts > on strings indexed by Unicode characters, not in terms of a > representation. Whether that string type internally uses UTF-16 or > UTF-8 should be invisible to its user. Ideally it would be capable of > carrying its data internally in either form (so as to avoid needless > conversion when both producer and consumer use the same form) and of > converting between the two (e.g. so as to append efficiently) as needed.
I think this is excessive. Most common operations with strings in application code are: * Pass the string around or compare as an opaque token * Draw the string on screen e.g. with QPainter (while technically it falls in the previous category, I think it's important enough to deserve separate item) * Find substring or pattern (regex) inside the string * Split the string by character, pattern, or index boundaries found by means of previous item I think the only common cases when dealing with Unicode grapheme clusters is required are * Handling of text cursor movement * Implementation of text shaping, i.e. what Harfbuzz is doing I think having special iterator would be quite enough for cursor case. Such iterator could abstract away underlying encoding, instead of forcing everyone to convert to UTF-16 first. > > Meanwhile, buffers of data (whether 8-bit, 16-bit or of other sizes) are > types we do need in diverse places - but they should be described > differently from the sting type (call it a "text" type, if hysterical > reasons oblige us to use "string" for its encoding). They can be > interpreted as strings, hence can serve as backing-store for a string, > provided they respect the relevant rules of a relevant encoding. > > If blob[index] always returns a Unicode *character*, then blob is a > string; if it can sometimes return one half of a UTF-16 surrogate pair > (as is the case with QString today) or one byte of a multi-byte UTF-8 > chunk, then blob is not really a string, it's just the storage for an > encoding of a string. > > What are our chances of getting this right in Qt 6 ? > It's the 21st century - way past time we did this, > > Eddy. > _______________________________________________ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Regards, Konstantin _______________________________________________ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development