Re: [Development] QUtf8String{, View}
Il 25/05/20 17:40, Thiago Macieira ha scritto: On Monday, 25 May 2020 04:37:26 PDT Edward Welbourne wrote: The "comparisons" heading might stretch as far as using a UTF-8 key to do a look-up in a QString-keyed hash, Using UTF-8 data to look up in a QString-keyed hash will require conversion to UTF-16 to calculate the hash. It can't be calculated on-the-fly. Being a bit creative, one could use an unordered_map, and with a custom transparent hasher that hashes the (first N) code points of the key... (Requires C++20) My 2 c, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: Firma crittografica S/MIME ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Monday, 25 May 2020 04:37:26 PDT Edward Welbourne wrote: > The "comparisons" heading might stretch as far as using a UTF-8 key to > do a look-up in a QString-keyed hash, Using UTF-8 data to look up in a QString-keyed hash will require conversion to UTF-16 to calculate the hash. It can't be calculated on-the-fly. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
On 25/05/2020 07.37, Edward Welbourne wrote: I would just call it QUtf8View, since (see below) I don't see value in a separate QUtf8String for it to be a view into On the one hand... std::string_view is not a view into a std::string. A std::string is a *container* for text, a std::string_view is a *view* for text. They both have 'string' in their name because they both deal with text, not because a std::string_view is a view of a std::string. Similarly, a QStringView may or may not be a view of a QString. Thus, it does not follow that having a QUtf8StringView in any way implies relation to, or existence of, a QUtf8String. On the other hand, "Utf8String" is arguably redundant. But so is "Latin1String", which we already have. I think, for the sake of existing precedent, QUtf8StringView is the correct name. If you are under the (mistaken) impression that an XStringView implies being a view of an XString, well, sorry, but that's just not the case, for any value of 'X' ('std::', 'Q', 'QUtf8', ...). On a different note, if we *had* QUtf8String and something like QAnyString, it might help with a migration path by which we eventually rename QString to QUtf16String (likely with an alias initially) and eventually make QString an alias for QUtf8String instead. -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
Thiago Macieira (23 May 2020 03:06) wrote: > Update: > > As we're reviewing the changes Lars is making to get rid of > QStringRef, Lars, Marc and I came to the conclusion that > QUtf8StringView is required for Qt 6.0. [snip] Sounds sensible. I would just call it QUtf8View, since (see below) I don't see value in a separate QUtf8String for it to be a view into, so making clear that it's a view, not backed by any particular string type, has value; but the detail of naming is less important. > There are currently no conclusions on QUtf8String and QAnyString, nor > on what the APIs should look like. I don't really see the need for an owning 8-bit string type (hence, equally, for QAnyString); we have QByteArray to serve as data-owner behind a UTF-8 view, when the data's not a C-string literal but is known to be UTF-8, and the simplicity of "when we store bytes with the semantics of text, we always do so in UTF-16" argues against doing anything more with UTF-8 views than supporting comparisons (including starts-with, ends-with, contains, index-of) and constructing a QString out of one. The "comparisons" heading might stretch as far as using a UTF-8 key to do a look-up in a QString-keyed hash, if doing so does actually bring a meaningful saving compared to converting to UTF-16 first; which, of course, might resurface in various other query APIs (asking for an HTTP header's value from an object packaging a map, or an HTML tag's attribute value). There are perhaps other places where it'll make sense for APIs taking a QStringView to also have a QUtf8View overload; but, crucially, by limiting UTF-8 to view-level support, we provide a bound on how widely it makes sense to burden our APIs with more overloads than just QString and/or up to two views. Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
On Saturday, 23 May 2020 05:39:37 PDT Giuseppe D'Angelo via Development wrote: > To elaborate on this: does operator==(QStringView, char*) already exist > (maybe under QT_NO_CAST...)? If yes, isn't that char* already assumed to > be UTF-8? Do you want a QUtf8StringView to cleanly compile also under > QT_NO_CAST_FROM_ASCII (and obviously use UTF-8, not Latin1), to reap > compile-time strlen, etc? All of the above. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
Il 23/05/20 03:06, Thiago Macieira ha scritto: As we're reviewing the changes Lars is making to get rid of QStringRef, Lars, Marc and I came to the conclusion that QUtf8StringView is required for Qt 6.0. That's because some methods that previously returned QStringRef now return QStringView and to retain compatibility with: if (xml.attribute("foo") == "bar") where QXmlStreamReader::attribute() returns QStringView, we really need to capture that "bar" as a UTF-8 string and we ought to have optimised UTF-16 to UTF-8 comparisons. So we're working on it. To elaborate on this: does operator==(QStringView, char*) already exist (maybe under QT_NO_CAST...)? If yes, isn't that char* already assumed to be UTF-8? Do you want a QUtf8StringView to cleanly compile also under QT_NO_CAST_FROM_ASCII (and obviously use UTF-8, not Latin1), to reap compile-time strlen, etc? Thanks, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: Firma crittografica S/MIME ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
Am 23.05.20 um 03:06 schrieb Thiago Macieira: On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: There's only our own lazyness which stands in the way of this better alternative. [snip the rest] Update: As we're reviewing the changes Lars is making to get rid of QStringRef, Lars, Marc and I came to the conclusion that QUtf8StringView is required for Qt 6.0. That's because some methods that previously returned QStringRef now return QStringView and to retain compatibility with: if (xml.attribute("foo") == "bar") where QXmlStreamReader::attribute() returns QStringView, we really need to capture that "bar" as a UTF-8 string and we ought to have optimised UTF-16 to UTF-8 comparisons. So we're working on it. If it had been wrapped in QLatin1String(), there would be no compatibility issues, as there already is an operator==() for QStringView/QLatin1String. There are currently no conclusions on QUtf8String and QAnyString, nor on what the APIs should look like. This also needs a solution in the other direction, QXmlStreamWriter: This is painfully slow, despite UTF-8 XML, UTF-8 source code (element names, attribute names) and data which might be used as - or transformed to - UTF-8 directly. Everything needs to go to UTF-16 first (QString), and then back to UTF-8. Allocations, Reference counting, no SSO... ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: > There's only our own lazyness which stands in the way of this better > alternative. [snip the rest] Update: As we're reviewing the changes Lars is making to get rid of QStringRef, Lars, Marc and I came to the conclusion that QUtf8StringView is required for Qt 6.0. That's because some methods that previously returned QStringRef now return QStringView and to retain compatibility with: if (xml.attribute("foo") == "bar") where QXmlStreamReader::attribute() returns QStringView, we really need to capture that "bar" as a UTF-8 string and we ought to have optimised UTF-16 to UTF-8 comparisons. So we're working on it. If it had been wrapped in QLatin1String(), there would be no compatibility issues, as there already is an operator==() for QStringView/QLatin1String. There are currently no conclusions on QUtf8String and QAnyString, nor on what the APIs should look like. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
> > QUtf8StringIterator can be easily added to extract Unicode codepoints from > the UTF-8 string like QStringIterator exists for the same in UTF-16. > I have discovered QStringIterator in KDAB blog, it is nice, thanks! A QUtf8StringIterator would boost Qt utf8 support like easily splitting/filtering a QByteArray based on QChar category. I agree though that it is not really useful for IO/networking code and much more on QString. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
On 5/16/20 6:16 PM, Thiago Macieira wrote: That opens a philosophical question. In: QString s = u"a a\u0301"; // U+0301 COMBINING ACUTE ACCENT s.replace('a', 'b'); Should we now have a b with accent? (b́) It's not philosophical at all, it's a defining question: at which level does QString operate? It does not operate at the EGC level, it operates at the UTF-16 level. (Proof: s.size() above is 4). Hence, the replace() above is merely replacing 0x0061 with 0x0062 in the char16_t-like storage. My 2 c, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: S/MIME Cryptographic Signature ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
On 15/05/2020 14.31, Thiago Macieira wrote: > Python's bstr still behaves string-like and has methods like > QByteArray::indexOf(const char *). QVector has no such methods. > > But since we do have QListSpecialMethods, we can add inject them into > QVector too. Right; I was assuming that would happen for e.g. decoding. I think 'block of memory' type functions also make sense, for example, startsWith. Byte arrays are unique in that *sequences* of elements often have meaning, which is not often the case for other sorts of arrays. It's quite reasonable to ask if a byte array starts with 0xDE, 0xAD, 0xBE, 0xEF (which is not text!) or to search for such sequences. Slicing (left, mid, right) are also useful, though those are probably useful for *any* array. -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
I facing a discussion like this every couple of months ;) Yes, we should have a b with accent, cause it is exactly what the programmer asked QString for; it is not our fault if b with accent is not what he meant to get after replace. QString (like any other tool) must not be used blindly or with zero knowledge about what it operates on and what the expected result is. s.length() - is not about amount of characters in the string; s.at(i) - does not return the i'th character in the string; and surely QUtf(8|32)String.replace('a', 'b') won't behave any "better" (I mean there won't be any magic behind the scenes that would save some idiot from writing some dumb code). Regards, Konstantin сб, 16 мая 2020 г. в 19:17, Thiago Macieira : > On sábado, 16 de maio de 2020 08:52:19 PDT Arnaud Clère wrote: > > Regarding the relevance of a QUtf8String, I feel like it would not be so > > useful unless it allows to view its content as QChar instead of char (or > > char8_t) since handling multibyte characters is so error prone. At least > a > > QChar handles most unicode characters as single entities... > > QUtf8StringIterator can be easily added to extract Unicode codepoints from > the > UTF-8 string like QStringIterator exists for the same in UTF-16. > > Usually, though, this means you're doing something wrong. Grapheme > clusters > can span multiple codepoints. Unless you're doing text shaping, you > probably > don't need them. And if you don't need them, why do you care about > codepoints > in the first place? > > That opens a philosophical question. In: > > QString s = u"a a\u0301"; // U+0301 COMBINING ACUTE ACCENT > s.replace('a', 'b'); > > Should we now have a b with accent? (b́) > > -- > Thiago Macieira - thiago.macieira (AT) intel.com > Software Architect - Intel System Software Products > > > > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development > ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
On sábado, 16 de maio de 2020 08:52:19 PDT Arnaud Clère wrote: > Regarding the relevance of a QUtf8String, I feel like it would not be so > useful unless it allows to view its content as QChar instead of char (or > char8_t) since handling multibyte characters is so error prone. At least a > QChar handles most unicode characters as single entities... QUtf8StringIterator can be easily added to extract Unicode codepoints from the UTF-8 string like QStringIterator exists for the same in UTF-16. Usually, though, this means you're doing something wrong. Grapheme clusters can span multiple codepoints. Unless you're doing text shaping, you probably don't need them. And if you don't need them, why do you care about codepoints in the first place? That opens a philosophical question. In: QString s = u"a a\u0301"; // U+0301 COMBINING ACUTE ACCENT s.replace('a', 'b'); Should we now have a b with accent? (b́) -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
Il 16/05/20 17:52, Arnaud Clère ha scritto: Regarding the relevance of a QUtf8String, I feel like it would not be so useful unless it allows to view its content as QChar instead of char (or char8_t) since handling multibyte characters is so error prone. At least a QChar handles most unicode characters as single entities... => QStringIterator. Cheers, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: Firma crittografica S/MIME ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
Hi all, As a user, I am happy with simplifications regarding string handling and I think the choice of utf16 for QString makes sense for most code except for file/networking code. I was once concerned about not being able to distinguish between QByteArray containing utf8 text and QByteArray containing other data. But now, I am using the following pattern to distinguish between both in the *very rare places* where I need to have a different overload for each, or want to provide a separate type with strong guarantees about a QByteArray content: struct Utf8 { QByteArray utf8; } This pattern obviously works for "tagging" any kind of data. See https://www.fluentcpp.com/2018/04/06/strong-types-by-struct/ for a presentation of this pattern. Regarding the relevance of a QUtf8String, I feel like it would not be so useful unless it allows to view its content as QChar instead of char (or char8_t) since handling multibyte characters is so error prone. At least a QChar handles most unicode characters as single entities... Cheers, Arnaud ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
On sexta-feira, 15 de maio de 2020 10:52:49 PDT Matthew Woehlke wrote: > > Like that, it's just "array of bytes of an arbitrary encoding (or none)". > > There's still a reason to have QByteArray and it'll need to exist in > > networking and file I/O code. That means the string classes, if any, need > > to be convertible to QByteArray anyway. > > I think we can learn from Python 3 here... QByteArray should go the way > of QStringList, i.e. yes, it *should* be a QVector. Like > QVector, it might (should) have additional methods, such as > explicit conversion to/from QString (a la Python's encode/decode), but > it should *not* have string-like manipulation (e.g. toUpper). Those are all Qt 7 work. We can deprecate those methods in Qt 6, but not remove them in 6.0. Python's bstr still behaves string-like and has methods like QByteArray::indexOf(const char *). QVector has no such methods. But since we do have QListSpecialMethods, we can add inject them into QVector too. > >> So, assuming the premise that QByteArray should not be string-ish > >> anymore, what do we want to have as the result type of QString::toUtf8() > >> and QString::toLatin1()? Do we really want mere bytes? > > Yes. Maybe. Again, this is how Python 3 works. > > It might make sense to have a QUtf8String class, but that should be > distinct from, and not implicitly constructible from, QByteArray a.k.a. > QVector. (Implicit conversion *to* QByteArray might be okay.) > > (BTW, is 'byte' QByte or std::byte? Can we possibly achieve the latter?) There's no QByte and we shouldn't have that type. std::byte is an enum around the actual byte type (unsigned char) and char is also a byte. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View}
On 14/05/2020 21.12, Thiago Macieira wrote: On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: Also, given a function like setFoo(const QByteArray &); what does this actually expect? An UTF-8 string? A local 8-bit string? An octet stream? A Latin-1 string? QByteArray is the jack of all these, master of none. Like that, it's just "array of bytes of an arbitrary encoding (or none)". There's still a reason to have QByteArray and it'll need to exist in networking and file I/O code. That means the string classes, if any, need to be convertible to QByteArray anyway. I think we can learn from Python 3 here... QByteArray should go the way of QStringList, i.e. yes, it *should* be a QVector. Like QVector, it might (should) have additional methods, such as explicit conversion to/from QString (a la Python's encode/decode), but it should *not* have string-like manipulation (e.g. toUpper). So, assuming the premise that QByteArray should not be string-ish anymore, what do we want to have as the result type of QString::toUtf8() and QString::toLatin1()? Do we really want mere bytes? Yes. Maybe. Again, this is how Python 3 works. It might make sense to have a QUtf8String class, but that should be distinct from, and not implicitly constructible from, QByteArray a.k.a. QVector. (Implicit conversion *to* QByteArray might be okay.) (BTW, is 'byte' QByte or std::byte? Can we possibly achieve the latter?) -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Friday, 15 May 2020 03:33:28 PDT Lars Knoll wrote: > Pretty much all uses of QL1String that I’ve seen are about ASCII only > content. That is certainly true for Qt itself, but also to a large degree > for our users. For those, utf-8 conversions are within 5% of latin1 > decoding. This makes it very clear to me that we should *not* have any > special handling for ascii that require a separate API. We don't want Latin1 content in our files. There are two reasons for having QLatin1String and not QAsciiString: 1) historical. It was added in 4.0 (2005) ,when a good fraction of people were still running 8-bit Latin1 or Latin9 as their locales. It was actually added as a replacemente for people writing macros like this in 3.x times: #define L1S(x) QString::fromLatin1(x) Additionially, we mis-purposed the name "Ascii" in Qt to mean "locale-encoded strings". 2) the Latin1 codec is FAST, but only because it needs to do no error checking. If we had a QAsciiString class or proper US-ASCII conversion functions, we'd get bug reports that something with a high bit set was not flagged and replaced with U+FFFD Replacement Character when converted. This error checking is similar to the UTF-8 decoding, which would make it as fast as the UTF-8 decoder in terms of performance for US-ASCII content. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
> On 15 May 2020, at 03:12, Thiago Macieira wrote: > > On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: >> Also, given a function like >> >>setFoo(const QByteArray &); >> >> what does this actually expect? An UTF-8 string? A local 8-bit string? >> An octet stream? A Latin-1 string? QByteArray is the jack of all these, >> master of none. What I would like to do right now for 6.0 is that all 8bit encoded text is assumed to be UTF-8. Simple as that. If it’s something else, the developer will have to take care of it himself. This is an important point for Qt 6.0 and independent of and QUtf8String we might or might not add later on. > > Like that, it's just "array of bytes of an arbitrary encoding (or none)". > There's still a reason to have QByteArray and it'll need to exist in > networking and file I/O code. That means the string classes, if any, need to > be convertible to QByteArray anyway. Agreed. > >> So, assuming the premiss that QByteArray should not be string-ish >> anymore, what do we want to have as the result type of QString::toUtf8() >> and QString::toLatin1()? Do we really want mere bytes? >> >> I don't think so. > > Since for Qt, String = UTF-16, then anything in another encoding is "a bag of > bytes". QByteArray does serve that purpose. > >> If Unicode succeeds, most I/O will be in the form of UTF-8. File names >> on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 >> (as they are on Windows). It makes a _ton_ of sense to have a container >> for this, and C++20 tempts us with char8_t to do exactly that. I'd love >> to do string processing in UTF-8 without potentially doubling the >> storage requirements by first converting it to UTF-16, then doing the >> processing, then converting it back. What are we actually gaining by having another string class? Yes, UTF-8 is being used in many places. But are the gains of directly working on UTF-8 enough to justify the duplication of all our string related APIs and implementations? > > Unless you're processing Cyrillic or Greek text, in which case your memory > usage will be about the same. Or if you're processing CJK, in which case > UTF-16 is a 33% reduction in memory use. Correct. Utf-8 only saves space for content that is mostly ascii. But if you only need ascii text processing, you can just as well do it on the current QByteArray. > >> Qt should have a strong story not just for UTF-16, but also for UTF-8. > > So long as it's not confusing on which class to use, sure. If that means a > proliferation of overloads everywhere, we've gone wrong somewhere. +1. Almost all other programming languages out there have standardised on one class for unicode string/text handling. IMO this is the correct approach. The fact that we’re using UTF-16 is historical, but it’s not better or worse than UTF-8. Let’s make transcoding fast, and stop worrying about several encodings. > >> I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, I’ll veto any UTF-32 string class. There is simply not a single good reason for using such a class. The only ‘advantage’ it has is one unicode code point per index, but that doesn’t help as unicode text processing anyways needs to look beyond that (at e.g. grapheme clusters etc). And it wastes lots of memory. >> provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 >> operations are not much slower than L1 <-> utf16 ones (I heard Lars' >> team has them down to within 5% of each other, not sure that's >> possible). > > The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is > within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The > difference in performance is the need to check if the high bit is set. Both > codecs are vectorised with both SSE2 and AVX2 implementations. There are also > Neon implementations, but I don't know their benchmark numbers (note: the > UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit). > > For non-US-ASCII Latin1 text, the performance is more than 5% worse, > depending > on how dense the non-ASCII characters are in the string. But given that we > want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 > should be rare. > > I also have an implementation of UTF-16 to ASCII codec, which is the same as > UTF-16 to Latin1, but without error checking. That requires that the string > class store whether it contains only US-ASCII. I've never pushed this to Qt. Pretty much all uses of QL1String that I’ve seen are about ASCII only content. That is certainly true for Qt itself, but also to a large degree for our users. For those, utf-8 conversions are within 5% of latin1 decoding. This makes it very clear to me that we should *not* have any special handling for ascii that require a separate API. Conversion speed for non ascii content is something we can improve, there are various BSD licensed
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Thu, May 14, 2020 at 06:12:15PM -0700, Thiago Macieira wrote: That means the string classes, if any, need to be convertible to QByteArray anyway. yes, via QTextCodec. (behind the scenes some friend functions may be used for zero-copy conversions.) ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: > Also, given a function like > > setFoo(const QByteArray &); > > what does this actually expect? An UTF-8 string? A local 8-bit string? > An octet stream? A Latin-1 string? QByteArray is the jack of all these, > master of none. Like that, it's just "array of bytes of an arbitrary encoding (or none)". There's still a reason to have QByteArray and it'll need to exist in networking and file I/O code. That means the string classes, if any, need to be convertible to QByteArray anyway. > So, assuming the premiss that QByteArray should not be string-ish > anymore, what do we want to have as the result type of QString::toUtf8() > and QString::toLatin1()? Do we really want mere bytes? > > I don't think so. Since for Qt, String = UTF-16, then anything in another encoding is "a bag of bytes". QByteArray does serve that purpose. > If Unicode succeeds, most I/O will be in the form of UTF-8. File names > on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 > (as they are on Windows). It makes a _ton_ of sense to have a container > for this, and C++20 tempts us with char8_t to do exactly that. I'd love > to do string processing in UTF-8 without potentially doubling the > storage requirements by first converting it to UTF-16, then doing the > processing, then converting it back. Unless you're processing Cyrillic or Greek text, in which case your memory usage will be about the same. Or if you're processing CJK, in which case UTF-16 is a 33% reduction in memory use. > Qt should have a strong story not just for UTF-16, but also for UTF-8. So long as it's not confusing on which class to use, sure. If that means a proliferation of overloads everywhere, we've gone wrong somewhere. > I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, > provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 > operations are not much slower than L1 <-> utf16 ones (I heard Lars' > team has them down to within 5% of each other, not sure that's > possible). The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The difference in performance is the need to check if the high bit is set. Both codecs are vectorised with both SSE2 and AVX2 implementations. There are also Neon implementations, but I don't know their benchmark numbers (note: the UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit). For non-US-ASCII Latin1 text, the performance is more than 5% worse, depending on how dense the non-ASCII characters are in the string. But given that we want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 should be rare. I also have an implementation of UTF-16 to ASCII codec, which is the same as UTF-16 to Latin1, but without error checking. That requires that the string class store whether it contains only US-ASCII. I've never pushed this to Qt. > Anyway, we'd have two class templates, and they'd just be > instantiated with different Char types to flesh out all of the above, > with the exception of the byte array ones: > >using QUtf8String = QBasicString; >using QString = QBasicString; >using QLatin1String = QBasicString; >(using QByteArray = QVector;) BTW, I've said this before: QVector should over-allocate by one element and memset it to zero, if the element is small enough (4 or 8 bytes). This should be done behind the scenes, so the API would never notice it. But it would allow transferring the ownership of a QByteArray's payload to any of the other classes and still have a null-terminated string. I don't mind having a QUtf8String{,View} but there needs to be a limit into how much we add to its API. Do we have indexOf(char32_t) optimised with vectorisation? Do we have indexOf(QRegularExpression)? The latter would make us link to libpcre2-8 in addition to libpcre2-16 or would require on-the-fly conversions and memory allocations. If your objective is to speed things up, having too many methods may actually make it worse. And then there's the overload set for generic functions. I'm going to insist a single, clear rule that does not depend on implementation details and is reasonably future-proof. It has to be about *what* the function does, not *how* it does that. > If, after getting all of the above runnig, we _then_ want The One String > (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we > need a QAnyString), which can contain any of the 2-4 string (view) > classes above (but not QByteArray(View)), but which doesn't have > string-ish API. Instead, you need to inspect it to extract the actual > string class (QLatin1String, QUtf8String, QString) contained, or simply > ask for the one you want, and it will convert, if necessary. Excluding QLatin1String since I don't think we need that, I'm willing to see this
[Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
Hi Lars, On 2020-05-12 09:49, Lars Knoll wrote: [...] One open question is whether we should add a QUtf8String with a char8_t. I am not yet convinced that we actually need the class though. [...] I positively want to stop using QByteArray as the QUtf8String that it currently is. QByteArray should lose all notion of string-ness (deprecate toLower() etc, remove in Qt 7) and be a QVector. Not sure we'll get there for Qt 6, not sure we'll get there with the name QByteArray, but that should be the end game for this class. The networking code is full of uses of QByteArray and due to the lack of QByteArrayRef (QStringRef) or QByteArrayView (QStringView), it's splitting and substringing is much less performant than it could be. Also, given a function like setFoo(const QByteArray &); what does this actually expect? An UTF-8 string? A local 8-bit string? An octet stream? A Latin-1 string? QByteArray is the jack of all these, master of none. So, assuming the premiss that QByteArray should not be string-ish anymore, what do we want to have as the result type of QString::toUtf8() and QString::toLatin1()? Do we really want mere bytes? I don't think so. If Unicode succeeds, most I/O will be in the form of UTF-8. File names on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 (as they are on Windows). It makes a _ton_ of sense to have a container for this, and C++20 tempts us with char8_t to do exactly that. I'd love to do string processing in UTF-8 without potentially doubling the storage requirements by first converting it to UTF-16, then doing the processing, then converting it back. Qt should have a strong story not just for UTF-16, but also for UTF-8. I've talked about this on QtWS, but here's TL;DV: of it: value_type container viewstring-ish API? char / QLatin1Char— QLatinString — QLatin1StringView — yes char8_t / qchar8 — QUtf8String — QUtf8StringView — yes char16_t / QChar — QString — QStringView — yes (char32_t — QUtf32String — QUtf32StringView — yes) std::byte — QByteArray — QByteArrayView— NO I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 operations are not much slower than L1 <-> utf16 ones (I heard Lars' team has them down to within 5% of each other, not sure that's possible). Anyway, we'd have two class templates, and they'd just be instantiated with different Char types to flesh out all of the above, with the exception of the byte array ones: using QUtf8String = QBasicString; using QString = QBasicString; using QLatin1String = QBasicString; (using QByteArray = QVector;) If, after getting all of the above runnig, we _then_ want The One String (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we need a QAnyString), which can contain any of the 2-4 string (view) classes above (but not QByteArray(View)), but which doesn't have string-ish API. Instead, you need to inspect it to extract the actual string class (QLatin1String, QUtf8String, QString) contained, or simply ask for the one you want, and it will convert, if necessary. With this, your typical Qt function taking strings would look like this: QLineEdit::setText(QAnyStringView text) { Q_D(QLineEdit); if (text == d->text) // mixed-mode comparisons are supported out of the box return; d->text = text.toString(); // centralized conversion to QString (in library, not user code) // also available: toLatin1(), toUtf8() update(); } Callers now have total freedom in what to pass: le->setText("Hi"); le->setText(u"Hi"); le->setText(u8"Hi"); le->setText(u"Hi"s); le->setText(u8"Hi"sv); le->setText(QVarLengthArray{'H', 'i'}); le->setText("Hello" % ", World"); // QStringBuilder and they'd all result in optimal code, because QAnyStringView is a trivial type (in the C++ sense), which means, unlike QString, it can be passed in CPU registers instead of on the stack. Likewise, parsing code could do Meep parseMeep(QAnyStringView str) { return str.visit([](auto str) { Meep meep; for (auto me : str.tokenize(u'\n')) meep += parse(me); return meep; }); } iow: instead of a bunch of overloads, you write your code as a template and let QAnyStringView instantiate your lambda with the actual type of string view passed. As a further example, here's op== for QAnyStringView (provided by Qt): bool operator==(QAnyStringView lhs, QAnyStringView rhs) noexcept { return lhs.visit([rhs](auto lhs) { return rhs.visit([lhs](auto rhs) { return lhs == rhs; }); }); } Last year, I heard someone (don't remember whom) suggest this for QString. That is: allow