Re: [Development] Qt6: Adding UTF-8 storage support to QString
> -Original Message- > From: Jason H > Sent: vendredi 25 janvier 2019 17:40 > Cc: development@qt-project.org > Subject: Re: [Development] Qt6: Adding UTF-8 storage support to QString > > > By all means, let's make sure the internals are efficient for the more > > common languages and scripts; but it's way past time to start doing > > Unicode properly, so that all cultures are well-served by default, > > when the software folk are using is built on Qt, > > I don't think anyone knows what "properly" is. +1 > But the more I think about it, the more I like the idea I expressed as a list > of sequences of various character sizes. > I think it is a good balance between space and efficiency. It looks like proposed boost::text::unencoded_rope to me, except they chose to implement it as a tree of string. https://github.com/boostcon/cppnow_presentations_2018/blob/master/05-07-2018_monday/boost_text_fixing_std_string_and_adding_unicode_to_standard_cpp__zach_laine__cppnow_2018__05072018.pdf It makes more sense to me if you consider that efficiently editing large strings is not so common. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Friday, 25 January 2019 13:39:49 PST Konstantin Tokarev wrote: > > All living languages are supposed to be stored in the BMP, which means no > > UTF-16 surrogate pairs to encode them. > > AFAIK all emojis are encoded with surrogate pairs Emojis are not part of a living language. They're drawings. But yes, they're outside the BMP. In any case, they're often represented by more than one codepoint anyway, so whether we used N*2 UTF-16 code units to represent them or N UTF-32 code units, it makes no difference. Your code needs to know how to deal with them, where to properly break, how to combine them, how to calculate the width, etc. Also note how they'd be represented by N*4 bytes in UTF-8, which means all three representations take exactly the same amount of memory. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
25.01.2019, 23:33, "Thiago Macieira" : > On Friday, 25 January 2019 04:54:22 PST Edward Welbourne wrote: >> we >> fail to properly support cultures whose scripts are relegated to the >> outer planes of Unicode - as, for example, the Chakma language's number >> system > > All living languages are supposed to be stored in the BMP, which means no > UTF-16 surrogate pairs to encode them. AFAIK all emojis are encoded with surrogate pairs > > That doesn't mean a single code unit, mind you. Think of combining characters. > > -- > Thiago Macieira - thiago.macieira (AT) intel.com > Software Architect - Intel Open Source Technology Center > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Regards, Konstantin ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Friday, 25 January 2019 04:54:22 PST Edward Welbourne wrote: > we > fail to properly support cultures whose scripts are relegated to the > outer planes of Unicode - as, for example, the Chakma language's number > system All living languages are supposed to be stored in the BMP, which means no UTF-16 surrogate pairs to encode them. That doesn't mean a single code unit, mind you. Think of combining characters. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Friday, 25 January 2019 08:54:38 PST Konstantin Tokarev wrote: > > How often do you need that, oustide of QString itself? And maybe a few > > efficient QtCore classes? (QCborValue comes to mind) > > Each time I need to interact efficiently with extenal code which isn't > Qt-based, e.g. WebKit, ICU. In particular, this extra copy would certainly > degrade performance of QtWebKit. > > Oh and you've mentioned CBOR, this implies that it won't be possible for Qt > users to make efficient implementation of a different serialization format. I didn't say we shouldn't have it. I was just trying to gather information about the need. So it looks like we do need it, if we ever change the encoding. My worry is that people will fail to handle the combinations properly. Which is why I dislike different encodings even more than changing it wholesale with an API- breaking change. However, one of my pending Qt 6 changes is to store a flag in QString that says "this UTF-16 string is known to contain only US-ASCII characters". That way, toUtf8() can use the faster toLatin1() algorithm (the flag is set by toUtf8() and toLatin1() the first time they're called). The problem is that it needs to clear that flag in all detach() calls. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
Il 25/01/19 10:49, Dominik Haumann ha scritto: Sidenote: Such a QStringIterator would also be helpful for KTextEditor, where we likely have some bugs we usually never see since we never have > UTF16 or composed characters. I've managed to merge it in QtCore some 5 years ago, comes with docs and tests: https://codereview.qt-project.org/#/c/77136/ You can use it today: CONFIG += core-private #include It's still missing a couple of bits and bolts to turn it public -- most notably, ranged for / STL loop support support. I'd also like to investigate more how it overlaps with SG16 / Boost.Text / etc. efforts before publishing the current API. My 2 c, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: Firma crittografica S/MIME ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
25.01.2019, 01:02, "Thiago Macieira" : > On Thursday, 24 January 2019 05:06:58 PST Konstantin Tokarev wrote: >> I will be officially pissed off if possibility to access raw data of QString >> without extra copy is gone It would be better if there is a way to figure >> out internal storage encoding (e.g. isUtf16()) and access raw data > > How often do you need that, oustide of QString itself? And maybe a few > efficient QtCore classes? (QCborValue comes to mind) Each time I need to interact efficiently with extenal code which isn't Qt-based, e.g. WebKit, ICU. In particular, this extra copy would certainly degrade performance of QtWebKit. Oh and you've mentioned CBOR, this implies that it won't be possible for Qt users to make efficient implementation of a different serialization format. -- Regards, Konstantin ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> By all means, let's make sure the internals are efficient for the more > common languages and scripts; but it's way past time to start doing > Unicode properly, so that all cultures are well-served by default, when > the software folk are using is built on Qt, I don't think anyone knows what "properly" is. But the more I think about it, the more I like the idea I expressed as a list of sequences of various character sizes. I think it is a good balance between space and efficiency. To recap that: A class that stores a list of list of same-width characters. For the most naive case the list is 1 list long and contains only 8bit characters. This performs identically to QByteArray. Non-ASCII languages requiring 16-bit storage are as QStrings are now. Then, in the more complicated scenarios, it breaks out 8-bit segments and 16-bit segments and makes them appear contiguous. (Emoji in ASCII text). Of course there could be functions to collapse it all to the uniform largest used width (maximize()) or break it apart to minimize() space (for very long 8-bit strings with occasional characters), and there can even be a bestFit() heuristic. And as always you can get it serialized as UTF-8 or 16... All the above also extends to 32-bit as well. I think this blends handles the average case very well (all characters of same width) and has reasonable cost for occasional exotic characters. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
Arnaud Clère (25 January 2019 10:59) wrote: > Most user code I have written or seen handles text data naively and is > incorrect in some respect but I think only a minority of if is leading > to real problems because input data will rarely trigger them. That depends a lot on who's supplying your data. The same rationale was given for "making do" with old 8-bit encodings, which meant programs worked for various rich nations' primary languages and didn't for anyone else's. Then we switched to UTF-16, which let us continue not thinking about what we're really doing, while reaching a larger slice of the world. Still, that leaves us complicit in suppressing various minority cultures by making software that works for the dominant culture around them, but not for them. Until we get into the habit of thinking of text properly (and I still don't even know the terminology, so I have a way to go on this, just like anyone) instead of as a sequence of evenly-sized units, we're going to continue either being inefficient (because we use units that are bigger than needed for many use-cases - arguably true of UTF-16) or we fail to properly support cultures whose scripts are relegated to the outer planes of Unicode - as, for example, the Chakma language's number system, which QLocale currently can't represent (QTBUG-69324) because the digits don't fit in a single UTF-16 unit (as QLocaleData expects of digits, signs and quotes, though it understands most of its other locale-specific texts might be longer). As a result, we can't support any Chakma locale. By all means, let's make sure the internals are efficient for the more common languages and scripts; but it's way past time to start doing Unicode properly, so that all cultures are well-served by default, when the software folk are using is built on Qt, Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> Original Message- > From: Thiago Macieira > > But we WILL NOT change from UTF-16 in the next 2 years. From a user standpoint, this seems perfectly Ok to me. I do not buy the argument that if switching QString to utf8 make developer bugs appear sooner, this is a good thing. Most user code I have written or seen handles text data naively and is incorrect in some respect but I think only a minority of if is leading to real problems because input data will rarely trigger them. Although not perfect, using 16 bits "characters" for QString and Windows API is good approximation that helped a lot make user code more robust without requiring understanding charsets and encodings. At least, it saved me a lot of time if I remember correctly the kind of bugs I was dealing with in the 90's. So, IMHO, accessing QString content in utf8 "character" units should remain an explicit choice, not the default one. Even choosing utf8 internally QString for performance reasons seems dubious to me, at least for a good half of the world... ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Thu, Jan 24, 2019 at 10:57 PM Thiago Macieira wrote: > > On Wednesday, 23 January 2019 23:32:28 PST Olivier Goffart wrote: > > - Introduce some iterator that iterates over unicode code points. > > I wrote that about a decade ago. It's called QStringIterator and it's inside > our sources, but in a private header. > > But we may want to make it iterate over grapheme clusters instead of Unicode > codepoints. That is, make it use QTextBoundaryFinder to iterate, instead of > decode the storage to UTF-32. > [...] Sidenote: Such a QStringIterator would also be helpful for KTextEditor, where we likely have some bugs we usually never see since we never have > UTF16 or composed characters. Greetings Dominik ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Thursday, 24 January 2019 05:06:58 PST Konstantin Tokarev wrote: > I will be officially pissed off if possibility to access raw data of QString > without extra copy is gone It would be better if there is a way to figure > out internal storage encoding (e.g. isUtf16()) and access raw data How often do you need that, oustide of QString itself? And maybe a few efficient QtCore classes? (QCborValue comes to mind) -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Wednesday, 23 January 2019 23:32:28 PST Olivier Goffart wrote: > - Introduce some iterator that iterates over unicode code points. I wrote that about a decade ago. It's called QStringIterator and it's inside our sources, but in a private header. But we may want to make it iterate over grapheme clusters instead of Unicode codepoints. That is, make it use QTextBoundaryFinder to iterate, instead of decode the storage to UTF-32. > - Deprecate utf16() and other API that assume that QString is UTF-16 > - Replace them by a toUtf16 which returns a QVector. I believe > that it is possible to make the cotent implicitly shared with the QString, > avoiding copies. (since it is just a QTypedArrayData internally) QVector. Sharing QVector and QString is possible, but we need to fix a few discrepancies, especially that of QVector not being allowed to be raw data, while QString can be (QVector::fromRawData was proposed for Qt 5.0 [Andreas Hartmetz, if I'm not mistaken] but we never added it). So this is fixable for Qt 6, but not before Qt 6. I think I tried even in my branch and ran into a lot of trouble. It was a non- obvious change. So I abandoned it. Still, we're not going to switch away from UTF-16 in Qt 6. The best we can do is pave the way for switching in Qt 7, if we add the methods you're talking about, change ALL the Windows, Cocoa and Android code that calls .data() and assumes it to be UTF-16 to toUtf16(). We may want to have some #defines like the QStringView stirng level or the ASCII-cast ones, so we catch those. But we WILL NOT change from UTF-16 in the next 2 years. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
24.01.2019, 10:34, "Olivier Goffart" : > On 23.01.19 23:15, André Pönitz wrote: >> On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote: >>> 23.01.2019, 16:55, "Edward Welbourne" : All of this discussion ignores a major elephant: QString's indexing is by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode for a couple of decades now. We *should* have a string type (I don't care what you call it) that acts on strings indexed by Unicode characters, not in terms of a representation. Whether that string type internally uses UTF-16 or UTF-8 should be invisible to its user. Ideally it would be capable of carrying its data internally in either form (so as to avoid needless conversion when both producer and consumer use the same form) and of converting between the two (e.g. so as to append efficiently) as needed. >>> >>> I think this is excessive. Most common operations with strings in >>> application >>> code are: >>> >>> * Pass the string around or compare as an opaque token >>> * Draw the string on screen e.g. with QPainter (while technically it >>> falls in the previous category, I think it's important enough to >>> deserve separate item) >>> * Find substring or pattern (regex) inside the string >>> * Split the string by character, pattern, or index boundaries found by >>> means >>> of previous item >>> >>> I think the only common cases when dealing with Unicode grapheme clusters >>> is required are >>> >>> * Handling of text cursor movement >>> * Implementation of text shaping, i.e. what Harfbuzz is doing >>> >>> I think having special iterator would be quite enough for cursor case. Such >>> iterator could abstract away underlying encoding, instead of forcing >>> everyone >>> to convert to UTF-16 first. >> >> All of that is scarily close to my opinion on the topic. > > Same here. I think Konstantin is spot on. > > Another example of good string design, I think, is the Rust's String. Their > string is encoded in valid UTF-8, indexed by bytes, and splitting the string > in > the middle of a code point is a programmer error. > > As already mentioned before, UTF-16 is quite a bad choice, if it weren't for > legacy. > > The argument of that developper wrongly using indexes cause more problem with > utf-8 than with utf-16 ("it would happen for a lot more characters") actually > means that the developper will see and fix their bugs quickly. > > I understand changing QString to UTF-8 is a difficult task if we want to do it > in a compatible way. However, I think there is a way: > In Qt5.x: > - Introduce some iterator that iterates over unicode code points. > - Deprecate utf16() and other API that assume that QString is UTF-16 > - Replace them by a toUtf16 which returns a QVector. I believe that > it is possible to make the cotent implicitly shared with the QString, avoiding > copies. (since it is just a QTypedArrayData internally) I will be officially pissed off if possibility to access raw data of QString without extra copy is gone :( It would be better if there is a way to figure out internal storage encoding (e.g. isUtf16()) and access raw data > > Then in Qt6 one can simply change the representation without breaking > compatibility with non-deprecated functions. > > -- > Olivier > > Woboq - Qt services and support - https://woboq.com - https://code.woboq.org > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Regards, Konstantin ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> - Introduce some iterator that iterates over unicode code points. QStringIterator > We *should* have a string type (I don't care what you call it) that acts > on strings indexed by Unicode characters, not in terms of a > representation. Whether that string type internally uses UTF-16 or > UTF-8 should be invisible to its user. Ideally it would be capable of > carrying its data internally in either form (so as to avoid needless > conversion when both producer and consumer use the same form) and of > converting between the two (e.g. so as to append efficiently) as needed. That's what I'd support with both hands. However, I don't think we could do that on QString without breaking most of the existing code. P.S. \note Unicode operates on "code points" not "characters". And moreover, there is no such thing like "glyph" in Unicode string. And looking for grapheme or glyph boundary is clearly not a string storage's or a string view's responsibility. Regards, Konstantin чт, 24 янв. 2019 г. в 10:33, Olivier Goffart : > On 23.01.19 23:15, André Pönitz wrote: > > On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote: > >> 23.01.2019, 16:55, "Edward Welbourne" : > >>> All of this discussion ignores a major elephant: QString's indexing is > >>> by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode > >>> for a couple of decades now. > >>> > >>> We *should* have a string type (I don't care what you call it) that > acts > >>> on strings indexed by Unicode characters, not in terms of a > >>> representation. Whether that string type internally uses UTF-16 or > >>> UTF-8 should be invisible to its user. Ideally it would be capable of > >>> carrying its data internally in either form (so as to avoid needless > >>> conversion when both producer and consumer use the same form) and of > >>> converting between the two (e.g. so as to append efficiently) as > needed. > >> > >> I think this is excessive. Most common operations with strings in > application > >> code are: > >> > >> * Pass the string around or compare as an opaque token > >> * Draw the string on screen e.g. with QPainter (while technically it > >>falls in the previous category, I think it's important enough to > >>deserve separate item) > >> * Find substring or pattern (regex) inside the string > >> * Split the string by character, pattern, or index boundaries found by > means > >>of previous item > >> > >> I think the only common cases when dealing with Unicode grapheme > clusters > >> is required are > >> > >> * Handling of text cursor movement > >> * Implementation of text shaping, i.e. what Harfbuzz is doing > >> > >> I think having special iterator would be quite enough for cursor case. > Such > >> iterator could abstract away underlying encoding, instead of forcing > everyone > >> to convert to UTF-16 first. > > > > All of that is scarily close to my opinion on the topic. > > Same here. I think Konstantin is spot on. > > Another example of good string design, I think, is the Rust's String. > Their > string is encoded in valid UTF-8, indexed by bytes, and splitting the > string in > the middle of a code point is a programmer error. > > As already mentioned before, UTF-16 is quite a bad choice, if it weren't > for > legacy. > > The argument of that developper wrongly using indexes cause more problem > with > utf-8 than with utf-16 ("it would happen for a lot more characters") > actually > means that the developper will see and fix their bugs quickly. > > I understand changing QString to UTF-8 is a difficult task if we want to > do it > in a compatible way. However, I think there is a way: > In Qt5.x: > - Introduce some iterator that iterates over unicode code points. > - Deprecate utf16() and other API that assume that QString is UTF-16 > - Replace them by a toUtf16 which returns a QVector. I believe > that > it is possible to make the cotent implicitly shared with the QString, > avoiding > copies. (since it is just a QTypedArrayData internally) > > Then in Qt6 one can simply change the representation without breaking > compatibility with non-deprecated functions. > > -- > Olivier > > Woboq - Qt services and support - https://woboq.com - > https://code.woboq.org > > > > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development > ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On 23.01.19 23:15, André Pönitz wrote: On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote: 23.01.2019, 16:55, "Edward Welbourne" : All of this discussion ignores a major elephant: QString's indexing is by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode for a couple of decades now. We *should* have a string type (I don't care what you call it) that acts on strings indexed by Unicode characters, not in terms of a representation. Whether that string type internally uses UTF-16 or UTF-8 should be invisible to its user. Ideally it would be capable of carrying its data internally in either form (so as to avoid needless conversion when both producer and consumer use the same form) and of converting between the two (e.g. so as to append efficiently) as needed. I think this is excessive. Most common operations with strings in application code are: * Pass the string around or compare as an opaque token * Draw the string on screen e.g. with QPainter (while technically it falls in the previous category, I think it's important enough to deserve separate item) * Find substring or pattern (regex) inside the string * Split the string by character, pattern, or index boundaries found by means of previous item I think the only common cases when dealing with Unicode grapheme clusters is required are * Handling of text cursor movement * Implementation of text shaping, i.e. what Harfbuzz is doing I think having special iterator would be quite enough for cursor case. Such iterator could abstract away underlying encoding, instead of forcing everyone to convert to UTF-16 first. All of that is scarily close to my opinion on the topic. Same here. I think Konstantin is spot on. Another example of good string design, I think, is the Rust's String. Their string is encoded in valid UTF-8, indexed by bytes, and splitting the string in the middle of a code point is a programmer error. As already mentioned before, UTF-16 is quite a bad choice, if it weren't for legacy. The argument of that developper wrongly using indexes cause more problem with utf-8 than with utf-16 ("it would happen for a lot more characters") actually means that the developper will see and fix their bugs quickly. I understand changing QString to UTF-8 is a difficult task if we want to do it in a compatible way. However, I think there is a way: In Qt5.x: - Introduce some iterator that iterates over unicode code points. - Deprecate utf16() and other API that assume that QString is UTF-16 - Replace them by a toUtf16 which returns a QVector. I believe that it is possible to make the cotent implicitly shared with the QString, avoiding copies. (since it is just a QTypedArrayData internally) Then in Qt6 one can simply change the representation without breaking compatibility with non-deprecated functions. -- Olivier Woboq - Qt services and support - https://woboq.com - https://code.woboq.org ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Wed, Jan 23, 2019 at 05:40:33PM +0300, Konstantin Tokarev wrote: > 23.01.2019, 16:55, "Edward Welbourne" : > > All of this discussion ignores a major elephant: QString's indexing is > > by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode > > for a couple of decades now. > > > > We *should* have a string type (I don't care what you call it) that acts > > on strings indexed by Unicode characters, not in terms of a > > representation. Whether that string type internally uses UTF-16 or > > UTF-8 should be invisible to its user. Ideally it would be capable of > > carrying its data internally in either form (so as to avoid needless > > conversion when both producer and consumer use the same form) and of > > converting between the two (e.g. so as to append efficiently) as needed. > > I think this is excessive. Most common operations with strings in application > code are: > > * Pass the string around or compare as an opaque token > * Draw the string on screen e.g. with QPainter (while technically it > falls in the previous category, I think it's important enough to > deserve separate item) > * Find substring or pattern (regex) inside the string > * Split the string by character, pattern, or index boundaries found by means > of previous item > > I think the only common cases when dealing with Unicode grapheme clusters > is required are > > * Handling of text cursor movement > * Implementation of text shaping, i.e. what Harfbuzz is doing > > I think having special iterator would be quite enough for cursor case. Such > iterator could abstract away underlying encoding, instead of forcing everyone > to convert to UTF-16 first. All of that is scarily close to my opinion on the topic. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Wednesday, 23 January 2019 06:07:37 PST Marco Bubke wrote: > Would it be not better to use a simple container and then functions on top > which use a view, so we could use them with any container If only we had a class that found boundaries in text... http://doc.qt.io/qt-5/qtextboundaryfinder.html -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
Marco Bubke (23 January 2019 15:07) wrote > Would it be not better to use a simple container and then functions on > top which use a view, so we could use them with any container. That sounds just fine to me. Indeed, in separating the "Unicode text" nature from its encoding, I'm fine with the *storage* being the encoding and the text being a view of that storage - just as long as we get an API that lets us deal with every form of storage (and encoding) consistently in terms of Unicode, when the code accessing it wants to do that. Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Wednesday, 23 January 2019 07:25:44 PST Jason H wrote: > > From: "Arnaud Clère" > > > > > And I don't want to add QUtf8String until SG16's char8_t gets settled. > > > It'll probably be settled by C++20, which means we can probably work on > > > this during Qt 6 lifetime, possibly even 6.1 or 6.2.> > > It makes sense to avoid future incompatibilities with the standard but > > fortunately Qt sometimes chooses to solve real problems ahead in time > > ;-) > Well C++20 is really how many months away? Qt6 won't be released until when? Give me the exact answers and I'll tell you if we can have this in Qt 6.0. The fact you can't is the problem: they're too much in flux and too close to each other for us to be able to accept char8_t as an established functionality that won't change by a later paper and design a solution for Qt 6.0. If we're lucky, we can do it. More likely, we'll have to wait a bit, possibly even for a compiler to implement it. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Wednesday, 23 January 2019 05:53:00 PST Edward Welbourne wrote: > What are our chances of getting this right in Qt 6 ? Not bad. But what you described is what SG16 is working on for std::text. So let's not do something different from them. We can prototype it and be first, though. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> From: "Arnaud Clère" > > And I don't want to add QUtf8String until SG16's char8_t gets settled. > > It'll probably be settled by C++20, which means we can probably work on > > this during Qt 6 lifetime, possibly even 6.1 or 6.2. > > It makes sense to avoid future incompatibilities with the standard but > fortunately Qt sometimes chooses to solve real problems ahead in time ;-) Well C++20 is really how many months away? Qt6 won't be released until when? It seems like both of these might land at the same time, except that the "by C++20" is (AFAICT) speculation. Uptake will also be slow. But by Qt being first we can get experience with the nature of the solution which might help inform the standard, or vice-versa. There's a risk we do something that conflicts with the standard in a useful way that people like, then we have fragmentation. Far smarter people than I have worked on this, so again burn this with fire, but my current thinking is: I think the problem is how all these things are implemented - they are basically escape codes, so it's impossible to say where thee current character ends and the next begins. This of course kills speed, but that's what we get for having more than one language on the planet plus emojis. It seems to me that the only real solution to keep it all fast is to progressively upgrade from bytes to the widest character and use that. This will have a scanning cost when it enters the address space if not denoted to the compiler or by the load method. If memory is a concern, the only alternative I see is to create a complex string: "strings" are now arrays of character arrays of uniform width, and hope that it is only ever one: "Ground control to Major Tom" - single sequence of 8 bit chars, len 27 size 27 "niños." encoded as 3 "strings", total length 6, size 7: + "ni" - "ni" (8 bit char sequence of 2 char) + "ñ" - 0001 (UTF16 16 bit char sequence of 1 char) + "os." - "o" (8 bit char sequence of 3 char) In the old days BASIC, I forget which one, but I'm remembering a Dr Dobbs or some other print medium (over 20 years ago), I read BASIC stores strings as a linked list of characters, I'm adapting that idea. There are many tradeoffs, but until we're ok with 32 bit characters, there will be tradeoffs on a multi-language planet. I just don't think escape codes should ever be stored in memory. Disk is fine. "Better to remain silent and be thought a fool than to speak and to remove all doubt." - (Disputed). I think I may have broken that rule here. "Please, be gentle." - Peter Venkman ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
23.01.2019, 16:55, "Edward Welbourne" : > All of this discussion ignores a major elephant: QString's indexing is > by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode > for a couple of decades now. > > We *should* have a string type (I don't care what you call it) that acts > on strings indexed by Unicode characters, not in terms of a > representation. Whether that string type internally uses UTF-16 or > UTF-8 should be invisible to its user. Ideally it would be capable of > carrying its data internally in either form (so as to avoid needless > conversion when both producer and consumer use the same form) and of > converting between the two (e.g. so as to append efficiently) as needed. I think this is excessive. Most common operations with strings in application code are: * Pass the string around or compare as an opaque token * Draw the string on screen e.g. with QPainter (while technically it falls in the previous category, I think it's important enough to deserve separate item) * Find substring or pattern (regex) inside the string * Split the string by character, pattern, or index boundaries found by means of previous item I think the only common cases when dealing with Unicode grapheme clusters is required are * Handling of text cursor movement * Implementation of text shaping, i.e. what Harfbuzz is doing I think having special iterator would be quite enough for cursor case. Such iterator could abstract away underlying encoding, instead of forcing everyone to convert to UTF-16 first. > > Meanwhile, buffers of data (whether 8-bit, 16-bit or of other sizes) are > types we do need in diverse places - but they should be described > differently from the sting type (call it a "text" type, if hysterical > reasons oblige us to use "string" for its encoding). They can be > interpreted as strings, hence can serve as backing-store for a string, > provided they respect the relevant rules of a relevant encoding. > > If blob[index] always returns a Unicode *character*, then blob is a > string; if it can sometimes return one half of a UTF-16 surrogate pair > (as is the case with QString today) or one byte of a multi-byte UTF-8 > chunk, then blob is not really a string, it's just the storage for an > encoding of a string. > > What are our chances of getting this right in Qt 6 ? > It's the 21st century - way past time we did this, > > Eddy. > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Regards, Konstantin ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
I am not sure it would be a good idea because a glyph can be still composed of more than one code points which is language dependent. Some time you want characters, sometimes code points and sometimes glyphs etc.. Would it be not better to use a simple container and then functions on top which use a view, so we could use them with any container. So we would avoid any allocations for transforming characters from one to the other container. But anyway I think there are many usages for strings that one class to tackle all this problems is not enough. From: Development on behalf of Edward Welbourne Sent: Wednesday, January 23, 2019 2:53:00 PM To: Arnaud Clère; Thiago Macieira Cc: development@qt-project.org Subject: Re: [Development] Qt6: Adding UTF-8 storage support to QString All of this discussion ignores a major elephant: QString's indexing is by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode for a couple of decades now. We *should* have a string type (I don't care what you call it) that acts on strings indexed by Unicode characters, not in terms of a representation. Whether that string type internally uses UTF-16 or UTF-8 should be invisible to its user. Ideally it would be capable of carrying its data internally in either form (so as to avoid needless conversion when both producer and consumer use the same form) and of converting between the two (e.g. so as to append efficiently) as needed. Meanwhile, buffers of data (whether 8-bit, 16-bit or of other sizes) are types we do need in diverse places - but they should be described differently from the sting type (call it a "text" type, if hysterical reasons oblige us to use "string" for its encoding). They can be interpreted as strings, hence can serve as backing-store for a string, provided they respect the relevant rules of a relevant encoding. If blob[index] always returns a Unicode *character*, then blob is a string; if it can sometimes return one half of a UTF-16 surrogate pair (as is the case with QString today) or one byte of a multi-byte UTF-8 chunk, then blob is not really a string, it's just the storage for an encoding of a string. What are our chances of getting this right in Qt 6 ? It's the 21st century - way past time we did this, Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
All of this discussion ignores a major elephant: QString's indexing is by 16-bit UTF-16 tokens, not by Unicode characters. We've had Unicode for a couple of decades now. We *should* have a string type (I don't care what you call it) that acts on strings indexed by Unicode characters, not in terms of a representation. Whether that string type internally uses UTF-16 or UTF-8 should be invisible to its user. Ideally it would be capable of carrying its data internally in either form (so as to avoid needless conversion when both producer and consumer use the same form) and of converting between the two (e.g. so as to append efficiently) as needed. Meanwhile, buffers of data (whether 8-bit, 16-bit or of other sizes) are types we do need in diverse places - but they should be described differently from the sting type (call it a "text" type, if hysterical reasons oblige us to use "string" for its encoding). They can be interpreted as strings, hence can serve as backing-store for a string, provided they respect the relevant rules of a relevant encoding. If blob[index] always returns a Unicode *character*, then blob is a string; if it can sometimes return one half of a UTF-16 surrogate pair (as is the case with QString today) or one byte of a multi-byte UTF-8 chunk, then blob is not really a string, it's just the storage for an encoding of a string. What are our chances of getting this right in Qt 6 ? It's the 21st century - way past time we did this, Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> -Original Message- > From: Thiago Macieira > > On Tuesday, 22 January 2019 09:01:16 PST Arnaud Clère wrote: > > QByteArray is the official way to deal with utf8 strings but: > > 1. This discussion shows it is not as known as it should be and I > > argue the name does not help 2. Dealing with binary data and all kind > > of string encodings in a single class is error-prone > > And yet that's what we used to have in Qt 3 (remember QCString?). We went > away from it for a reason. Sorry no, I never used Qt3. I just googled it looking for problems and only found ones that should be solved now by QByteArray: - explicit sharing - bad performance due to append() being O(length()) since it scans for a null terminator > And 3: some character-mutating operations in QByteArray (toUpper, etc.) are > Latin1, not UTF-8. A QUtf8String could override toUpper() and toLower() which are unfortunate if QByteArray really is the official way to deal with utf-8 strings... > > Hence my suggestion of adding a QUtf8String deriving from QByteArray... > Not likely to happen. If we add a QUtf8String, it will be like QLatin1String, > which in turn was meant to be similar to QStringView, not like QString. That > means no mutation and no owning memory. The use case I am talking about is really a mutable utf8 container, even though it could provide a QUtf8StringLiteral macro similar to QByteArrayLiteral. I do not understand why a QUtf8String should necessarily be like a QLatinString. OTOH, I would love to be able to manipulate QLatin1String/QUtf8String with a QStringView when dealing with possibly non-ASCII content. But QStringView seems to require knowing the number of remaining Unicode characters in constant time so I guess it is out of question... > And I don't want to add QUtf8String until SG16's char8_t gets settled. It'll > probably be settled by C++20, which means we can probably work on this during > Qt 6 lifetime, possibly even 6.1 or 6.2. It makes sense to avoid future incompatibilities with the standard but fortunately Qt sometimes chooses to solve real problems ahead in time ;-) ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Tuesday, 22 January 2019 09:01:16 PST Arnaud Clère wrote: > QByteArray is the official way to deal with utf8 strings but: > 1. This discussion shows it is not as known as it should be and I argue the > name does not help > 2. Dealing with binary data and all kind of string > encodings in a single class is error-prone And yet that's what we used to have in Qt 3 (remember QCString?). We went away from it for a reason. And 3: some character-mutating operations in QByteArray (toUpper, etc.) are Latin1, not UTF-8. > Hence my suggestion of adding a QUtf8String deriving from QByteArray... Not likely to happen. If we add a QUtf8String, it will be like QLatin1String, which in turn was meant to be similar to QStringView, not like QString. That means no mutation and no owning memory. And I don't want to add QUtf8String until SG16's char8_t gets settled. It'll probably be settled by C++20, which means we can probably work on this during Qt 6 lifetime, possibly even 6.1 or 6.2. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Tuesday, 22 January 2019 11:02:22 PST Matthew Woehlke wrote: > On 18/01/2019 11.09, Thiago Macieira wrote: > > As for strings, the QString constructor takes UTF-8 input, but however > > fast > > the decoder is, it's still slightly slower than the Latin1 decoder. So if > > your string is purely US-ASCII, using QLatin1String is recommended. > > ...but I assume QStringLiteral remains even faster? (I would think so; > not only is *no* decoding needed, which you could also get just by using > wide string literals, but also no *allocation*...) Yes. In terms of CPU cycles, for a given string length of US-ASCII content: QUtf8::convertToUnicode > qt_from_latin1 > memcpy > ∅ (fromUtf8, fromLatin1, fromUtf16, QStringLiteral) The empty set symbol indicates that QStringLiteral is requires no operation on the content (O(1) on length). The others are O(n). -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On 18/01/2019 11.09, Thiago Macieira wrote: > As for strings, the QString constructor takes UTF-8 input, but however fast > the decoder is, it's still slightly slower than the Latin1 decoder. So if > your > string is purely US-ASCII, using QLatin1String is recommended. ...but I assume QStringLiteral remains even faster? (I would think so; not only is *no* decoding needed, which you could also get just by using wide string literals, but also no *allocation*...) -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> Original Message- > From: Jason H > > > From: "Arnaud Clère" > > > > > -Original Message- > > > From: Allan Sandfeld Jensen > > > > > > Use QByteArray when you can. > > > > I think a QUtf8String class derived from QByteArray would help a lot making > > this happen in the real world! > > Feel free to burn this suggestion with fire, but what about: > > typedef QSymbolSequence QLatin1String; > typedef QSymbolSequence QByteArray; > typedef QSymbolSequence QByteArray; > typedef QSymbolSequence QString; > > So they can have the same API? It really seems to me that the issue is > storage, not that they need a different API to operate on the storage. This is close to QStringView and it would be nice to be able to build one from QByteArray/QUtf8String to access utf8 characters as QChar on the fly. It would avoid most MBCS problems with utf8 strings. Unfortunately, I am afraid this is not possible for QStringView since it must know the number of remaining characters and utf8 requires to decode the whole string to know that. My point was not as ambitious: QByteArray is the official way to deal with utf8 strings but: 1. This discussion shows it is not as known as it should be and I argue the name does not help 2. Dealing with binary data and all kind of string encodings in a single class is error-prone Hence my suggestion of adding a QUtf8String deriving from QByteArray... I have no idea if it would be feasible though ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Tuesday, 22 January 2019 06:49:51 PST Jason H wrote: > typedef QSymbolSequence QLatin1String; > typedef QSymbolSequence QByteArray; > typedef QSymbolSequence QByteArray; > typedef QSymbolSequence QString; > > So they can have the same API? It really seems to me that the issue is > storage, not that they need a different API to operate on the storage. That QSymbolSequence template class does not exist and is not easy to implement. Storage is not the problem, it's actually the algorithms that operate on and transform the contents. They'd have to be rewritten for each of the four. Go ahead and give it a try, though. This may also be what SG16 intends for C++23, so it may be an interesting trial run. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> Sent: Monday, January 21, 2019 at 9:51 AM > From: "Arnaud Clère" > To: "Allan Sandfeld Jensen" , "development@qt-project.org" > > Subject: Re: [Development] Qt6: Adding UTF-8 storage support to QString > > > -Original Message- > > From: Allan Sandfeld Jensen > > > > On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: > > > Any chance of having UTF-8 storage support for QString? > > > > > Use QByteArray when you can. > > I think a QUtf8String class derived from QByteArray would help a lot making > this happen in the real world! > 1. It would be found more easily by users in need of a utf8 encoded dynamic > string > 2. It would allow making the encoding explicit (QString or QUtf8String or > QLatin1String) in newer Qt APIs or user-defined ones, and even totally safe > if disabling const char * casts is possible > 3. It would allow adding QString-like APIs (like setNum(), simplified(), > etc.) over the time without cluttering QByteArray > > Moreover, I have a specific use-case where QByteArray args are used as binary > data (say CBOR) and a specific Utf8String is useful to handle utf8 encoded > args without always encoding/decoding to utf16. > I might not be the only one... Feel free to burn this suggestion with fire, but what about: typedef QSymbolSequence QLatin1String; typedef QSymbolSequence QByteArray; typedef QSymbolSequence QByteArray; typedef QSymbolSequence QString; So they can have the same API? It really seems to me that the issue is storage, not that they need a different API to operate on the storage. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> -Original Message- > From: Allan Sandfeld Jensen > > On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: > > Any chance of having UTF-8 storage support for QString? > > > Use QByteArray when you can. I think a QUtf8String class derived from QByteArray would help a lot making this happen in the real world! 1. It would be found more easily by users in need of a utf8 encoded dynamic string 2. It would allow making the encoding explicit (QString or QUtf8String or QLatin1String) in newer Qt APIs or user-defined ones, and even totally safe if disabling const char * casts is possible 3. It would allow adding QString-like APIs (like setNum(), simplified(), etc.) over the time without cluttering QByteArray Moreover, I have a specific use-case where QByteArray args are used as binary data (say CBOR) and a specific Utf8String is useful to handle utf8 encoded args without always encoding/decoding to utf16. I might not be the only one... ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Friday, 18 January 2019 08:57:19 PST Tor Arne Vestbø wrote: > > On 18 Jan 2019, at 17:21, Thiago Macieira > > Actually, what we should do is allow everywhere > > > > functionTakingString(u"Tor Arne Vestbø") > > // (note the u) > > Yes, this would be awesome! Please let’s do this > > And I guess without QT_NO_CAST_FROM_ASCII you’d still be able to do: > > functionTakingString("Tor Arne Vestbø”) // without the ‘u’, runtime cost Right, but given the benefit of char16_t literals, we should encourage the QT_NO_CAST_FROM_ASCII even more! It's a single extra letter in your source and even if the compiler is misconfigured and is producing mojibake for your surname, my middle name or Jędrzej's first name, it will still work for US- ASCII content ("a broken clock is right twice a day" type of "work"). In fact, we ought to look into replacing our QLatin1String content with char16_t literals in our sources. Pros: avoid the Latin1 decoder, which is slower[¹] than a pure memcpy. Cons: doubles the size of the string. So I'd use QLatin1String only for uncommonly used strings, where saving a few bytes is worth it. [¹] see https://analysis.godbolt.org/z/OZ-5Gz, which contains the inner loop of qt_from_latin1_internal (an AVX2 build[²]) and compare to an equivalent memcpy in https://analysis.godbolt.org/z/7vR2jW. Note how the memcpy loop according to llvm-mca has 3 cycles fewer of latency than the latin1 decoder. And this is not an optimal memcpy loop. [²] Our builds are not AVX2 by default. You're only going to get this performance if you build with -march=native (Gentoo?) or you use Clear Linux. The defaults are much worse. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Friday, 18 January 2019 08:13:40 PST Kai Koehne wrote: > 1. We generally compile Qt code with QT_NO_CAST_FROM_ASCII that disables the > QString(const char *) overload. And we do that so that you have to make it > explicit whether you really want to do the implicit conversion from UTF-8 > to UTF-16, use QStringLiteral() to encode it as UTF-16 at compile time, or > rather have it translated with a tr() call. > > I think for Qt code explicit is better than implicit, so I actually would > stay with QT_NO_CAST_FROM_ASCII. Actually, what we should do is allow everywhere functionTakingString(u"Tor Arne Vestbø") // (note the u) Which causes the compiler to encode the string in UTF-16, bypassing the need for runtime decoding, and enforcing sources as UTF-8, so we get consistent binaries. It's just one step short of QStringLiteral in that it will still allocate memory, but it only needs a memcpy. Such code also works with functions taking QStringView without memory allocation. We all know that QStringLiteral has drawbacks when it comes to unloading modules. For QtCore, obviously QStringLiteral is not a problem, but other modules may decide to avoid it. PS: I still want to improve QStringLiteral, but it will still be different from a pure char16_t literal. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> -Original Message- > From: Development On Behalf Of Tor > Arne Vestbø > Sent: Friday, January 18, 2019 4:27 PM > To: Jedrzej Nowacki > Cc: Thiago Macieira ; development@qt- > project.org > Subject: Re: [Development] Qt6: Adding UTF-8 storage support to QString > > Picking up on this: > > If we plan to standardise on our Qt source code being UTF8, can we please > allow QString(“Tor Arne Vestbø") without going through > QLatin1Literal/QStringLiteral/QLatin1String/etc etc? I think you're touching two different things here: 1. We generally compile Qt code with QT_NO_CAST_FROM_ASCII that disables the QString(const char *) overload. And we do that so that you have to make it explicit whether you really want to do the implicit conversion from UTF-8 to UTF-16, use QStringLiteral() to encode it as UTF-16 at compile time, or rather have it translated with a tr() call. I think for Qt code explicit is better than implicit, so I actually would stay with QT_NO_CAST_FROM_ASCII. 2. We require all Qt source code to be ASCII only. This is AFAIK mostly because of the editor in Visual Studio, who's even in its latest incarnation doesn't have a global option to save files in UTF-8 instead of . Here I'm not sure anymore whether being conservative buys us much. VS after all has a heuristic to open a UTF-8 encoded file correctly, so the problem mostly is that people might create a new file with non-UTF-8 content, or copy it from another project. Regards Kai ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Friday, 18 January 2019 07:26:51 PST Tor Arne Vestbø wrote: > If we plan to standardise on our Qt source code being UTF8, can we please > allow QString(“Tor Arne Vestbø") without going through > QLatin1Literal/QStringLiteral/QLatin1String/etc etc? I think we now can. The last problem we had was MSVC pre-2015 update 2, which added the /utf-8 switch. Without that option, any non-ASCII character in the source code, even in comments, could cause compilation errors by causing a decoding error in whichever codepage the user used in his/her OS. I think all our builds now use /utf-8, which means UTF-8 is permitted everywhere now. You can use it in comments ("Copyright Klarälvdalens ...", for example) and in strings. Please don't use it in identifiers. As for strings, the QString constructor takes UTF-8 input, but however fast the decoder is, it's still slightly slower than the Latin1 decoder. So if your string is purely US-ASCII, using QLatin1String is recommended. PS: we don't need SG16's char8_t, but we'll need to add support for it. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
Picking up on this: If we plan to standardise on our Qt source code being UTF8, can we please allow QString(“Tor Arne Vestbø") without going through QLatin1Literal/QStringLiteral/QLatin1String/etc etc? Tor Arne > On 18 Jan 2019, at 16:01, Jedrzej Nowacki wrote: > > Dnia środa, 16 stycznia 2019 21:12:55 CET André Pönitz pisze: >> On Tue, Jan 15, 2019 at 10:44:45PM +0100, Allan Sandfeld Jensen wrote: >>> On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: Hi, With every Qt release we see how the new release improved over previous releases in terms of speed, memory consumption, etc. Any chance of having UTF-8 storage support for QString? >>> >>> Use QByteArray when you can. >> >> Unfortunately, quite a few APIs require to use QString, even if >> the typically use case would be completely fine even with ASCII, >> like keys in QVariantMap or QSettings. >> >> Andre' > > As a travelling person with name that can not be represented with latin1, I > can tell you some funny stories about systems that authors thought that > "ascii > is enough". Unless you want to keep only hex codes or sha1s, please use > bigger > character set. > > Cheers, > Jędrek > > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
Dnia środa, 16 stycznia 2019 21:12:55 CET André Pönitz pisze: > On Tue, Jan 15, 2019 at 10:44:45PM +0100, Allan Sandfeld Jensen wrote: > > On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: > > > Hi, > > > > > > With every Qt release we see how the new release improved over previous > > > releases in terms of speed, memory consumption, etc. > > > > > > Any chance of having UTF-8 storage support for QString? > > > > Use QByteArray when you can. > > Unfortunately, quite a few APIs require to use QString, even if > the typically use case would be completely fine even with ASCII, > like keys in QVariantMap or QSettings. > > Andre' As a travelling person with name that can not be represented with latin1, I can tell you some funny stories about systems that authors thought that "ascii is enough". Unless you want to keep only hex codes or sha1s, please use bigger character set. Cheers, Jędrek ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Thursday, 17 January 2019 13:27:40 PST Martin Koller wrote: > On Mittwoch, 16. Jänner 2019 19:44:27 CET Konstantin Tokarev wrote: > > From QtWebKit perpective it would be great if Qt APIs which require > > QString now would also accept QLatin1String at least for ASCII-only data > is QtWebKit still alive ? > Seems there is nobody working on it since more than a year... Konstantin is the maintainer, but I haven't seen releases recently, so it's not something I could recommend depending on. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Mittwoch, 16. Jänner 2019 19:44:27 CET Konstantin Tokarev wrote: > From QtWebKit perpective it would be great if Qt APIs which require QString > now would also accept QLatin1String at least for ASCII-only data is QtWebKit still alive ? Seems there is nobody working on it since more than a year... -- Best regards/Schöne Grüße Martin A: Because it breaks the logical sequence of discussion Q: Why is top posting bad? () ascii ribbon campaign - against html e-mail /\- against proprietary attachments Geschenkideen, Accessoires, Seifen, Kulinarisches: www.lillehus.at ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Wednesday, 16 January 2019 13:16:39 PST Konstantin Tokarev wrote: > 1. Code points may be encoded as surrogate pairs in UTF-16, e.g. this is the > case for Emoji characters. QString ignores this fact, indexing 16-bit > QChars. To make things worse, several QString methods like left(), right(), > and mid() will happily cut surrogate pair in a half. So does QByteArray or so would an UTF-8 based QString, except it would happen for a lot more characters. What you want is QTextBoundaryFinder and possible QFontMetrics. > 2. When people are talking about character indexing they often imply > indexing of grapheme clusters. In Unicode world grapheme cluster may be > represented as a several code points depending on normalization form of the > source. To make things worse, even in NFC form not every grapheme cluster > that is possible in Unicode is representable as a single code point. Indeed, and SG16 in the C++ Standard is looking into grapheme clusters as the basis unit. Unfortunately, their work does not coincide with our Qt 6 timelines, nor would we be able to adapt that quickly based on how much code there is using QString. We should pay attention to the SG16 work and make sure it works with Qt 6, with eyes towards a better API in Qt 7. Nowhere did I say that we should use UTF-8. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
15.01.2019, 23:13, "Alexander Akulich" : > Cristian, > > the previous discussion is "Why can't QString use UTF-8 internally?" > There is something wrong with our maillist, the best link I found is > [1]. For some reason link to the thread head [2] is broken. > > [1] > https://lists.qt-project.org/pipermail/development/2015-February/040199.html Note that if anyone wants to use easier character indexing as an argument for using UTF-16 instead of UTF-8, that's not the case. 1. Code points may be encoded as surrogate pairs in UTF-16, e.g. this is the case for Emoji characters. QString ignores this fact, indexing 16-bit QChars. To make things worse, several QString methods like left(), right(), and mid() will happily cut surrogate pair in a half. 2. When people are talking about character indexing they often imply indexing of grapheme clusters. In Unicode world grapheme cluster may be represented as a several code points depending on normalization form of the source. To make things worse, even in NFC form not every grapheme cluster that is possible in Unicode is representable as a single code point. > [2] > https://lists.qt-project.org/pipermail/development/2015-February/020155.html > > On Tue, Jan 15, 2019 at 9:48 PM Cristian Adam wrote: >> Hi, >> >> With every Qt release we see how the new release improved over previous >> releases in terms of speed, memory consumption, etc. >> >> Any chance of having UTF-8 storage support for QString? >> >> UTF-8 is native on Linux and other *NIX platforms, Qt programs should use >> less memory, and perform better by reading less bytes from memory. >> >> Did anybody try this? >> >> I've heard that Qt Creator is storing sources files both in UTF-8 format >> for libclang, and UTF16 for its internal usage. That sounds like a bit >> wasteful. >> >> KDE Plasma could then better compare / compete with the other Linux desktop >> environments which use UTF-8 for strings. >> >> I guess I could use CopperSpice to test this, since they added CsString >> with both QString8 (UTF-8) and QString16 (UTF-16) supported. >> >> https://utf8everywhere.org/ states "UTF-16 is the worst of both worlds, >> being both variable length and too wide" >> >> Cheers, >> Cristian. >> ___ >> Development mailing list >> Development@qt-project.org >> https://lists.qt-project.org/listinfo/development > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Regards, Konstantin ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Tue, Jan 15, 2019 at 10:44:45PM +0100, Allan Sandfeld Jensen wrote: > On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: > > Hi, > > > > With every Qt release we see how the new release improved over previous > > releases in terms of speed, memory consumption, etc. > > > > Any chance of having UTF-8 storage support for QString? > > > Use QByteArray when you can. Unfortunately, quite a few APIs require to use QString, even if the typically use case would be completely fine even with ASCII, like keys in QVariantMap or QSettings. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> Sent: Tuesday, January 15, 2019 at 4:44 PM > From: "Allan Sandfeld Jensen" > To: development@qt-project.org > Subject: Re: [Development] Qt6: Adding UTF-8 storage support to QString > > On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: > > Hi, > > > > With every Qt release we see how the new release improved over previous > > releases in terms of speed, memory consumption, etc. > > > > Any chance of having UTF-8 storage support for QString? > > > Use QByteArray when you can. And *I* do. (Not the OP). But I would love a QByteArray that matches QString's API. I wrote this email, then was going about gathering evidence to make my case about why QByteArray was inadequate. It seems many of my complaints with using QByteArray over the years have been addressed unbeknownst to me, though one glaring omission remains: - QByteArray lacks QString's arg() support. The QString("%1").arg(X) combination is pretty readable, and reliable, and maintainable. I don't know how the average Qt user stacks up, but I only use QStrings in UIs (because I have to), the rest (which is a lot) is all QByteArray. When I'm using QString not in a UI, it's normally ending with toUtf8(). I don't really care utf8 vs 16, vs whatever (slight bias for utf8 as it looks better in hex editors), I just hate that I have to deal with character width issue this concretely by having to code against one of two very similar but not equivalent classes. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Wednesday, 16 January 2019 10:44:27 PST Konstantin Tokarev wrote: > From QtWebKit perpective it would be great if Qt APIs which require QString > now would also accept QLatin1String at least for ASCII-only data Which ones? Currently, the only thing that takes QLatin1String in the API is QString itself. Where would you like to see more QLatin1String API? -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
15.01.2019, 21:45, "Cristian Adam" : > Hi, > > With every Qt release we see how the new release improved over previous > releases in terms of speed, memory consumption, etc. > > Any chance of having UTF-8 storage support for QString? > > UTF-8 is native on Linux and other *NIX platforms, Qt programs should use > less memory, and perform better by reading less bytes from memory. > > Did anybody try this? > > I've heard that Qt Creator is storing sources files both in UTF-8 format for > libclang, and UTF16 for its internal usage. That sounds like a bit wasteful. > > KDE Plasma could then better compare / compete with the other Linux desktop > environments which use UTF-8 for strings. > > I guess I could use CopperSpice to test this, since they added CsString with > both QString8 (UTF-8) and QString16 (UTF-16) supported. > > https://utf8everywhere.org/ states "UTF-16 is the worst of both worlds, being > both variable length and too wide" From QtWebKit perpective it would be great if Qt APIs which require QString now would also accept QLatin1String at least for ASCII-only data > > Cheers, > Cristian. > , > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Regards, Konstantin ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
16.01.2019, 00:46, "Allan Sandfeld Jensen" : > On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: >> Hi, >> >> With every Qt release we see how the new release improved over previous >> releases in terms of speed, memory consumption, etc. >> >> Any chance of having UTF-8 storage support for QString? > > Use QByteArray when you can. Problem is that with many Qt APIs one must use QString > > Regards > 'Allan > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development -- Regards, Konstantin ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
Marco Bubke (16 January 2019 10:59) reported: >> https://utf8everywhere.org/ states "UTF-16 is the worst of both >> worlds, being both variable length and too wide" Konstantin Ritt (16 January 2019 17:50) replied > https://utf8everywhere.org/ states bullshit. try reading an alternative > sources. At the very least, one might guess from the site name that it's possible the site does not "speak in a neutral voice" upon the subject; it has a clear bias in favour of UTF-8. Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
> https://utf8everywhere.org/ states *"UTF-16 is the worst of both worlds, being both variable length and too wide"* https://utf8everywhere.org/ *states bullshit. try reading an alternative sources.* Regards, Konstantin ср, 16 янв. 2019 г. в 13:20, Edward Welbourne : > Marco Bubke (16 January 2019 10:59) > > You can use std::string which as small string optimization instead of > > QByteArray too. In many cases where you would use const String > > you can use std::string_view, so you are more flexible. > > Note that we now have a QStringView, which can likewise replace many > uses of const QString & - not that this is any help with UTF-8. > Uptake has been slow, but some of us are using it. > > Eddy. > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development > ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
Marco Bubke (16 January 2019 10:59) > You can use std::string which as small string optimization instead of > QByteArray too. In many cases where you would use const String > you can use std::string_view, so you are more flexible. Note that we now have a QStringView, which can likewise replace many uses of const QString & - not that this is any help with UTF-8. Uptake has been slow, but some of us are using it. Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
You can use std::string which as small string optimization instead of QByteArray too. In many cases where you would use const String you can use std::string_view, so you are more flexible. From: Development on behalf of Allan Sandfeld Jensen Sent: Tuesday, January 15, 2019 10:44:45 PM To: development@qt-project.org Subject: Re: [Development] Qt6: Adding UTF-8 storage support to QString On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: > Hi, > > With every Qt release we see how the new release improved over previous > releases in terms of speed, memory consumption, etc. > > Any chance of having UTF-8 storage support for QString? > Use QByteArray when you can. Regards 'Allan ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Dienstag, 15. Januar 2019 19:43:57 CET Cristian Adam wrote: > Hi, > > With every Qt release we see how the new release improved over previous > releases in terms of speed, memory consumption, etc. > > Any chance of having UTF-8 storage support for QString? > Use QByteArray when you can. Regards 'Allan ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
Cristian, the previous discussion is "Why can't QString use UTF-8 internally?" There is something wrong with our maillist, the best link I found is [1]. For some reason link to the thread head [2] is broken. [1] https://lists.qt-project.org/pipermail/development/2015-February/040199.html [2] https://lists.qt-project.org/pipermail/development/2015-February/020155.html On Tue, Jan 15, 2019 at 9:48 PM Cristian Adam wrote: > > Hi, > > With every Qt release we see how the new release improved over previous > releases in terms of speed, memory consumption, etc. > > Any chance of having UTF-8 storage support for QString? > > UTF-8 is native on Linux and other *NIX platforms, Qt programs should use > less memory, and perform better by reading less bytes from memory. > > Did anybody try this? > > I've heard that Qt Creator is storing sources files both in UTF-8 format for > libclang, and UTF16 for its internal usage. That sounds like a bit wasteful. > > KDE Plasma could then better compare / compete with the other Linux desktop > environments which use UTF-8 for strings. > > I guess I could use CopperSpice to test this, since they added CsString with > both QString8 (UTF-8) and QString16 (UTF-16) supported. > > https://utf8everywhere.org/ states "UTF-16 is the worst of both worlds, being > both variable length and too wide" > > Cheers, > Cristian. > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] Qt6: Adding UTF-8 storage support to QString
On Tuesday, 15 January 2019 10:43:57 PST Cristian Adam wrote: > Any chance of having UTF-8 storage support for QString? No. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel Open Source Technology Center ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
[Development] Qt6: Adding UTF-8 storage support to QString
Hi, With every Qt release we see how the new release improved over previous releases in terms of speed, memory consumption, etc. Any chance of having UTF-8 storage support for QString? UTF-8 is native on Linux and other *NIX platforms, Qt programs should use less memory, and perform better by reading less bytes from memory. Did anybody try this? I've heard that Qt Creator is storing sources files both in UTF-8 format for libclang, and UTF16 for its internal usage. That sounds like a bit wasteful. KDE Plasma could then better compare / compete with the other Linux desktop environments which use UTF-8 for strings. I guess I could use CopperSpice to test this, since they added CsString with both QString8 (UTF-8) and QString16 (UTF-16) supported. https://utf8everywhere.org/ states *"UTF-16 is the worst of both worlds, being both variable length and too wide"* Cheers, Cristian. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development