Re: [Development] QUtf8String{, View}

2020-05-25 Thread Giuseppe D'Angelo via Development

Il 25/05/20 17:40, Thiago Macieira ha scritto:

On Monday, 25 May 2020 04:37:26 PDT Edward Welbourne wrote:

The "comparisons" heading might stretch as far as using a UTF-8 key to
do a look-up in a QString-keyed hash,

Using UTF-8 data to look up in a QString-keyed hash will require conversion to
UTF-16 to calculate the hash. It can't be calculated on-the-fly.


Being a bit creative, one could use an unordered_map, and 
with a custom transparent hasher that hashes the (first N) code points 
of the key... (Requires C++20)


My 2 c,
--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: Firma crittografica S/MIME
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-25 Thread Thiago Macieira
On Monday, 25 May 2020 04:37:26 PDT Edward Welbourne wrote:
> The "comparisons" heading might stretch as far as using a UTF-8 key to
> do a look-up in a QString-keyed hash,

Using UTF-8 data to look up in a QString-keyed hash will require conversion to 
UTF-16 to calculate the hash. It can't be calculated on-the-fly.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-25 Thread Matthew Woehlke

On 25/05/2020 07.37, Edward Welbourne wrote:

I would just call it QUtf8View, since (see below) I don't see value in a
separate QUtf8String for it to be a view into


On the one hand...

std::string_view is not a view into a std::string. A std::string is a 
*container* for text, a std::string_view is a *view* for text. They both 
have 'string' in their name because they both deal with text, not 
because a std::string_view is a view of a std::string. Similarly, a 
QStringView may or may not be a view of a QString. Thus, it does not 
follow that having a QUtf8StringView in any way implies relation to, or 
existence of, a QUtf8String.


On the other hand, "Utf8String" is arguably redundant. But so is 
"Latin1String", which we already have.


I think, for the sake of existing precedent, QUtf8StringView is the 
correct name. If you are under the (mistaken) impression that an 
XStringView implies being a view of an XString, well, sorry, but that's 
just not the case, for any value of 'X' ('std::', 'Q', 'QUtf8', ...).


On a different note, if we *had* QUtf8String and something like 
QAnyString, it might help with a migration path by which we eventually 
rename QString to QUtf16String (likely with an alias initially) and 
eventually make QString an alias for QUtf8String instead.


--
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-25 Thread Edward Welbourne
Thiago Macieira (23 May 2020 03:06) wrote:
> Update:
>
> As we're reviewing the changes Lars is making to get rid of
> QStringRef, Lars, Marc and I came to the conclusion that
> QUtf8StringView is required for Qt 6.0.
[snip]

Sounds sensible.
I would just call it QUtf8View, since (see below) I don't see value in a
separate QUtf8String for it to be a view into, so making clear that it's
a view, not backed by any particular string type, has value; but the
detail of naming is less important.

> There are currently no conclusions on QUtf8String and QAnyString, nor
> on what the APIs should look like.

I don't really see the need for an owning 8-bit string type (hence,
equally, for QAnyString); we have QByteArray to serve as data-owner
behind a UTF-8 view, when the data's not a C-string literal but is known
to be UTF-8, and the simplicity of "when we store bytes with the
semantics of text, we always do so in UTF-16" argues against doing
anything more with UTF-8 views than supporting comparisons (including
starts-with, ends-with, contains, index-of) and constructing a QString
out of one.

The "comparisons" heading might stretch as far as using a UTF-8 key to
do a look-up in a QString-keyed hash, if doing so does actually bring a
meaningful saving compared to converting to UTF-16 first; which, of
course, might resurface in various other query APIs (asking for an HTTP
header's value from an object packaging a map, or an HTML tag's
attribute value).

There are perhaps other places where it'll make sense for APIs taking a
QStringView to also have a QUtf8View overload; but, crucially, by
limiting UTF-8 to view-level support, we provide a bound on how widely
it makes sense to burden our APIs with more overloads than just QString
and/or up to two views.

Eddy.
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-23 Thread Thiago Macieira
On Saturday, 23 May 2020 05:39:37 PDT Giuseppe D'Angelo via Development wrote:
> To elaborate on this: does operator==(QStringView, char*) already exist
> (maybe under QT_NO_CAST...)? If yes, isn't that char* already assumed to
> be UTF-8? Do you want a QUtf8StringView to cleanly compile also under
> QT_NO_CAST_FROM_ASCII (and obviously use UTF-8, not Latin1), to reap
> compile-time strlen, etc?

All of the above.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-23 Thread Giuseppe D'Angelo via Development

Il 23/05/20 03:06, Thiago Macieira ha scritto:

As we're reviewing the changes Lars is making to get rid of QStringRef, Lars,
Marc and I came to the conclusion that QUtf8StringView is required for Qt 6.0.
That's because some methods that previously returned QStringRef now return
QStringView and to retain compatibility with:

 if (xml.attribute("foo") == "bar")

where QXmlStreamReader::attribute() returns QStringView, we really need to
capture that "bar" as a UTF-8 string and we ought to have optimised UTF-16 to
UTF-8 comparisons. So we're working on it.


To elaborate on this: does operator==(QStringView, char*) already exist 
(maybe under QT_NO_CAST...)? If yes, isn't that char* already assumed to 
be UTF-8? Do you want a QUtf8StringView to cleanly compile also under 
QT_NO_CAST_FROM_ASCII (and obviously use UTF-8, not Latin1), to reap 
compile-time strlen, etc?


Thanks,
--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: Firma crittografica S/MIME
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-23 Thread Kai Pastor, DG0YT

Am 23.05.20 um 03:06 schrieb Thiago Macieira:

On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:

There's only our own lazyness which stands in the way of this better
alternative.

[snip the rest]

Update:

As we're reviewing the changes Lars is making to get rid of QStringRef, Lars,
Marc and I came to the conclusion that QUtf8StringView is required for Qt 6.0.
That's because some methods that previously returned QStringRef now return
QStringView and to retain compatibility with:

 if (xml.attribute("foo") == "bar")

where QXmlStreamReader::attribute() returns QStringView, we really need to
capture that "bar" as a UTF-8 string and we ought to have optimised UTF-16 to
UTF-8 comparisons. So we're working on it.

If it had been wrapped in QLatin1String(), there would be no compatibility
issues, as there already is an operator==() for QStringView/QLatin1String.

There are currently no conclusions on QUtf8String and QAnyString, nor on what
the APIs should look like.

This also needs a solution in the other direction, QXmlStreamWriter: 
This is painfully slow, despite UTF-8 XML, UTF-8 source code (element 
names, attribute names) and data which might be used as - or transformed 
to - UTF-8 directly. Everything needs to go to UTF-16 first (QString), 
and then back to UTF-8. Allocations, Reference counting, no SSO...



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-22 Thread Thiago Macieira
On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:
> There's only our own lazyness which stands in the way of this better
> alternative.
[snip the rest]

Update:

As we're reviewing the changes Lars is making to get rid of QStringRef, Lars, 
Marc and I came to the conclusion that QUtf8StringView is required for Qt 6.0. 
That's because some methods that previously returned QStringRef now return 
QStringView and to retain compatibility with:

if (xml.attribute("foo") == "bar")

where QXmlStreamReader::attribute() returns QStringView, we really need to 
capture that "bar" as a UTF-8 string and we ought to have optimised UTF-16 to 
UTF-8 comparisons. So we're working on it.

If it had been wrapped in QLatin1String(), there would be no compatibility 
issues, as there already is an operator==() for QStringView/QLatin1String.

There are currently no conclusions on QUtf8String and QAnyString, nor on what 
the APIs should look like.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-16 Thread Arnaud Clère
>
> QUtf8StringIterator can be easily added to extract Unicode codepoints from
> the UTF-8 string like QStringIterator exists for the same in UTF-16.
>

I have discovered QStringIterator in KDAB blog, it is nice, thanks!
A QUtf8StringIterator would boost Qt utf8 support like easily
splitting/filtering a QByteArray based on QChar category.
I agree though that it is not really useful for IO/networking code and much
more on QString.
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-16 Thread Giuseppe D'Angelo via Development

On 5/16/20 6:16 PM, Thiago Macieira wrote:

That opens a philosophical question. In:

 QString s = u"a a\u0301"; // U+0301 COMBINING ACUTE ACCENT
 s.replace('a', 'b');

Should we now have a b with accent? (b́)


It's not philosophical at all, it's a defining question: at which level 
does QString operate? It does not operate at the EGC level, it operates 
at the UTF-16 level. (Proof: s.size() above is 4). Hence, the replace() 
above is merely replacing 0x0061 with 0x0062 in the char16_t-like storage.


My 2 c,
--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: S/MIME Cryptographic Signature
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-16 Thread Matthew Woehlke
On 15/05/2020 14.31, Thiago Macieira wrote:
> Python's bstr still behaves string-like and has methods like 
> QByteArray::indexOf(const char *).  QVector has no such methods.
> 
> But since we do have QListSpecialMethods, we can add inject them into
> QVector too.

Right; I was assuming that would happen for e.g. decoding. I think
'block of memory' type functions also make sense, for example,
startsWith. Byte arrays are unique in that *sequences* of elements often
have meaning, which is not often the case for other sorts of arrays.
It's quite reasonable to ask if a byte array starts with 0xDE, 0xAD,
0xBE, 0xEF (which is not text!) or to search for such sequences. Slicing
(left, mid, right) are also useful, though those are probably useful for
*any* array.

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-16 Thread Konstantin Ritt
I facing a discussion like this every couple of months ;)

Yes, we should have a b with accent, cause it is exactly what the
programmer asked QString for; it is not our fault if b with accent is not
what he meant to get after replace.
QString (like any other tool) must not be used blindly or with zero
knowledge about what it operates on and what the expected result is.

s.length() - is not about amount of characters in the string; s.at(i) -
does not return the i'th character in the string; and surely
QUtf(8|32)String.replace('a', 'b') won't behave any "better" (I mean there
won't be any magic behind the scenes that would save some idiot from
writing some dumb code).

Regards,
Konstantin


сб, 16 мая 2020 г. в 19:17, Thiago Macieira :

> On sábado, 16 de maio de 2020 08:52:19 PDT Arnaud Clère wrote:
> > Regarding the relevance of a QUtf8String, I feel like it would not be so
> > useful unless it allows to view its content as QChar instead of char (or
> > char8_t) since handling multibyte characters is so error prone. At least
> a
> > QChar handles most unicode characters as single entities...
>
> QUtf8StringIterator can be easily added to extract Unicode codepoints from
> the
> UTF-8 string like QStringIterator exists for the same in UTF-16.
>
> Usually, though, this means you're doing something wrong. Grapheme
> clusters
> can span multiple codepoints. Unless you're doing text shaping, you
> probably
> don't need them. And if you don't need them, why do you care about
> codepoints
> in the first place?
>
> That opens a philosophical question. In:
>
> QString s = u"a a\u0301"; // U+0301 COMBINING ACUTE ACCENT
> s.replace('a', 'b');
>
> Should we now have a b with accent? (b́)
>
> --
> Thiago Macieira - thiago.macieira (AT) intel.com
>   Software Architect - Intel System Software Products
>
>
>
> ___
> Development mailing list
> Development@qt-project.org
> https://lists.qt-project.org/listinfo/development
>
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-16 Thread Thiago Macieira
On sábado, 16 de maio de 2020 08:52:19 PDT Arnaud Clère wrote:
> Regarding the relevance of a QUtf8String, I feel like it would not be so
> useful unless it allows to view its content as QChar instead of char (or
> char8_t) since handling multibyte characters is so error prone. At least a
> QChar handles most unicode characters as single entities...

QUtf8StringIterator can be easily added to extract Unicode codepoints from the 
UTF-8 string like QStringIterator exists for the same in UTF-16.

Usually, though, this means you're doing something wrong. Grapheme clusters 
can span multiple codepoints. Unless you're doing text shaping, you probably 
don't need them. And if you don't need them, why do you care about codepoints 
in the first place?

That opens a philosophical question. In:

QString s = u"a a\u0301"; // U+0301 COMBINING ACUTE ACCENT
s.replace('a', 'b');

Should we now have a b with accent? (b́)

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-16 Thread Giuseppe D'Angelo via Development

Il 16/05/20 17:52, Arnaud Clère ha scritto:
Regarding the relevance of a QUtf8String, I feel like it would not be so 
useful unless it allows to view its content as QChar instead of char (or 
char8_t) since handling multibyte characters is so error prone. At least 
a QChar handles most unicode characters as single entities...


=> QStringIterator.

Cheers,
--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: Firma crittografica S/MIME
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-16 Thread Arnaud Clère
Hi all,

As a user, I am happy with simplifications regarding string handling and I
think the choice of utf16 for QString makes sense for most code except for
file/networking code.

I was once concerned about not being able to distinguish between QByteArray
containing utf8 text and QByteArray containing other data. But now, I am
using the following pattern to distinguish between both in the *very rare
places* where I need to have a different overload for each, or want to
provide a separate type with strong guarantees about a QByteArray content:

struct Utf8 { QByteArray utf8; }

This pattern obviously works for "tagging" any kind of data. See
https://www.fluentcpp.com/2018/04/06/strong-types-by-struct/ for a
presentation of this pattern.

Regarding the relevance of a QUtf8String, I feel like it would not be so
useful unless it allows to view its content as QChar instead of char (or
char8_t) since handling multibyte characters is so error prone. At least a
QChar handles most unicode characters as single entities...

Cheers,
Arnaud
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-15 Thread Thiago Macieira
On sexta-feira, 15 de maio de 2020 10:52:49 PDT Matthew Woehlke wrote:
> > Like that, it's just "array of bytes of an arbitrary encoding (or none)".
> > There's still a reason to have QByteArray and it'll need to exist in
> > networking and file I/O code. That means the string classes, if any, need
> > to be convertible to QByteArray anyway.
> 
> I think we can learn from Python 3 here... QByteArray should go the way
> of QStringList, i.e. yes, it *should* be a QVector. Like
> QVector, it might (should) have additional methods, such as
> explicit conversion to/from QString (a la Python's encode/decode), but
> it should *not* have string-like manipulation (e.g. toUpper).

Those are all Qt 7 work. We can deprecate those methods in Qt 6, but not 
remove them in 6.0.

Python's bstr still behaves string-like and has methods like 
QByteArray::indexOf(const char *).  QVector has no such methods.

But since we do have QListSpecialMethods, we can add inject them into QVector 
too.

> >> So, assuming the premise that QByteArray should not be string-ish
> >> anymore, what do we want to have as the result type of QString::toUtf8()
> >> and QString::toLatin1()? Do we really want mere bytes?
> 
> Yes. Maybe. Again, this is how Python 3 works.
> 
> It might make sense to have a QUtf8String class, but that should be
> distinct from, and not implicitly constructible from, QByteArray a.k.a.
> QVector. (Implicit conversion *to* QByteArray might be okay.)
> 
> (BTW, is 'byte' QByte or std::byte? Can we possibly achieve the latter?)

There's no QByte and we shouldn't have that type.

std::byte is an enum around the actual byte type (unsigned char) and char is 
also a byte.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View}

2020-05-15 Thread Matthew Woehlke

On 14/05/2020 21.12, Thiago Macieira wrote:

On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:

Also, given a function like

 setFoo(const QByteArray &);

what does this actually expect? An UTF-8 string? A local 8-bit string?
An octet stream? A Latin-1 string? QByteArray is the jack of all these,
master of none.


Like that, it's just "array of bytes of an arbitrary encoding (or none)".
There's still a reason to have QByteArray and it'll need to exist in
networking and file I/O code. That means the string classes, if any, need to
be convertible to QByteArray anyway.


I think we can learn from Python 3 here... QByteArray should go the way 
of QStringList, i.e. yes, it *should* be a QVector. Like 
QVector, it might (should) have additional methods, such as 
explicit conversion to/from QString (a la Python's encode/decode), but 
it should *not* have string-like manipulation (e.g. toUpper).



So, assuming the premise that QByteArray should not be string-ish
anymore, what do we want to have as the result type of QString::toUtf8()
and QString::toLatin1()? Do we really want mere bytes?


Yes. Maybe. Again, this is how Python 3 works.

It might make sense to have a QUtf8String class, but that should be 
distinct from, and not implicitly constructible from, QByteArray a.k.a. 
QVector. (Implicit conversion *to* QByteArray might be okay.)


(BTW, is 'byte' QByte or std::byte? Can we possibly achieve the latter?)

--
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-15 Thread Thiago Macieira
On Friday, 15 May 2020 03:33:28 PDT Lars Knoll wrote:
> Pretty much all uses of QL1String that I’ve seen are about ASCII only
> content. That is certainly true for Qt itself, but also to a large degree
> for our users. For those, utf-8 conversions are within 5% of latin1
> decoding. This makes it very clear to me that we should *not* have any
> special handling for ascii that require a separate API.

We don't want Latin1 content in our files. There are two reasons for having 
QLatin1String and not QAsciiString:

1) historical. It was added in 4.0 (2005) ,when a good fraction of people were 
still running 8-bit Latin1 or Latin9 as their locales. It was actually added 
as a replacemente for people writing macros like this in 3.x times:

#define L1S(x)  QString::fromLatin1(x)

Additionially, we mis-purposed the name "Ascii" in Qt to mean "locale-encoded 
strings".

2) the Latin1 codec is FAST, but only because it needs to do no error 
checking. If we had a QAsciiString class or proper US-ASCII conversion 
functions, we'd get bug reports that something with a high bit set was not 
flagged and replaced with U+FFFD Replacement Character when converted. This 
error checking is similar to the UTF-8 decoding, which would make it as fast 
as the UTF-8 decoder in terms of performance for US-ASCII content.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-15 Thread Lars Knoll
> On 15 May 2020, at 03:12, Thiago Macieira  wrote:
> 
> On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:
>> Also, given a function like
>> 
>>setFoo(const QByteArray &);
>> 
>> what does this actually expect? An UTF-8 string? A local 8-bit string?
>> An octet stream? A Latin-1 string? QByteArray is the jack of all these,
>> master of none.

What I would like to do right now for 6.0 is that all 8bit encoded text is 
assumed to be UTF-8. Simple as that. If it’s something else, the developer will 
have to take care of it himself. This is an important point for Qt 6.0 and 
independent of and QUtf8String we might or might not add later on.
> 
> Like that, it's just "array of bytes of an arbitrary encoding (or none)". 
> There's still a reason to have QByteArray and it'll need to exist in 
> networking and file I/O code. That means the string classes, if any, need to 
> be convertible to QByteArray anyway.

Agreed.
> 
>> So, assuming the premiss that QByteArray should not be string-ish
>> anymore, what do we want to have as the result type of QString::toUtf8()
>> and QString::toLatin1()? Do we really want mere bytes?
>> 
>> I don't think so.
> 
> Since for Qt, String = UTF-16, then anything in another encoding is "a bag of 
> bytes". QByteArray does serve that purpose.
> 
>> If Unicode succeeds, most I/O will be in the form of UTF-8. File names
>> on Unix are UTF-8 (for all intents and purposes these days), not UTF-16
>> (as they are on Windows). It makes a _ton_ of sense to have a container
>> for this, and C++20 tempts us with char8_t to do exactly that. I'd love
>> to do string processing in UTF-8 without potentially doubling the
>> storage requirements by first converting it to UTF-16, then doing the
>> processing, then converting it back.

What are we actually gaining by having another string class? Yes, UTF-8 is 
being used in many places. But are the gains of directly working on UTF-8 
enough to justify the duplication of all our string related APIs and 
implementations?
> 
> Unless you're processing Cyrillic or Greek text, in which case your memory 
> usage will be about the same. Or if you're processing CJK, in which case 
> UTF-16 is a 33% reduction in memory use.

Correct. Utf-8 only saves space for content that is mostly ascii. But if you 
only need ascii text processing, you can just as well do it on the current 
QByteArray.
> 
>> Qt should have a strong story not just for UTF-16, but also for UTF-8.
> 
> So long as it's not confusing on which class to use, sure. If that means a 
> proliferation of overloads everywhere, we've gone wrong somewhere.

+1. 

Almost all other programming languages out there have standardised on one class 
for unicode string/text handling. IMO this is the correct approach. The fact 
that we’re using UTF-16 is historical, but it’s not better or worse than UTF-8. 
Let’s make transcoding fast, and stop worrying about several encodings.
> 
>> I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one,

I’ll veto any UTF-32 string class. There is simply not a single good reason for 
using such a class. The only ‘advantage’ it has is one unicode code point per 
index, but that doesn’t help as unicode text processing anyways needs to look 
beyond that (at e.g. grapheme clusters etc). And it wastes lots of memory.

>> provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16
>> operations are not much slower than L1 <-> utf16 ones (I heard Lars'
>> team has them down to within 5% of each other, not sure that's
>> possible). 
> 
> The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is 
> within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The 
> difference in performance is the need to check if the high bit is set. Both 
> codecs are vectorised with both SSE2 and AVX2 implementations. There are also 
> Neon implementations, but I don't know their benchmark numbers (note: the 
> UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit).
> 
> For non-US-ASCII Latin1 text, the performance is more than 5% worse, 
> depending 
> on how dense the non-ASCII characters are in the string. But given that we 
> want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 
> should be rare.
> 
> I also have an implementation of UTF-16 to ASCII codec, which is the same as 
> UTF-16 to Latin1, but without error checking. That requires that the string 
> class store whether it contains only US-ASCII. I've never pushed this to Qt.

Pretty much all uses of QL1String that I’ve seen are about ASCII only content. 
That is certainly true for Qt itself, but also to a large degree for our users. 
For those, utf-8 conversions are within 5% of latin1 decoding. This makes it 
very clear to me that we should *not* have any special handling for ascii that 
require a separate API.

Conversion speed for non ascii content is something we can improve, there are 
various BSD licensed 

Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-15 Thread Oswald Buddenhagen

On Thu, May 14, 2020 at 06:12:15PM -0700, Thiago Macieira wrote:
That means the string classes, if any, need to be convertible to 
QByteArray anyway.



yes, via QTextCodec.
(behind the scenes some friend functions may be used for zero-copy 
conversions.)

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-14 Thread Thiago Macieira
On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:
> Also, given a function like
> 
> setFoo(const QByteArray &);
> 
> what does this actually expect? An UTF-8 string? A local 8-bit string?
> An octet stream? A Latin-1 string? QByteArray is the jack of all these,
> master of none.

Like that, it's just "array of bytes of an arbitrary encoding (or none)". 
There's still a reason to have QByteArray and it'll need to exist in 
networking and file I/O code. That means the string classes, if any, need to 
be convertible to QByteArray anyway.

> So, assuming the premiss that QByteArray should not be string-ish
> anymore, what do we want to have as the result type of QString::toUtf8()
> and QString::toLatin1()? Do we really want mere bytes?
> 
> I don't think so.

Since for Qt, String = UTF-16, then anything in another encoding is "a bag of 
bytes". QByteArray does serve that purpose.

> If Unicode succeeds, most I/O will be in the form of UTF-8. File names
> on Unix are UTF-8 (for all intents and purposes these days), not UTF-16
> (as they are on Windows). It makes a _ton_ of sense to have a container
> for this, and C++20 tempts us with char8_t to do exactly that. I'd love
> to do string processing in UTF-8 without potentially doubling the
> storage requirements by first converting it to UTF-16, then doing the
> processing, then converting it back.

Unless you're processing Cyrillic or Greek text, in which case your memory 
usage will be about the same. Or if you're processing CJK, in which case 
UTF-16 is a 33% reduction in memory use.

> Qt should have a strong story not just for UTF-16, but also for UTF-8.

So long as it's not confusing on which class to use, sure. If that means a 
proliferation of overloads everywhere, we've gone wrong somewhere.

> I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one,
> provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16
> operations are not much slower than L1 <-> utf16 ones (I heard Lars'
> team has them down to within 5% of each other, not sure that's
> possible). 

The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is 
within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The 
difference in performance is the need to check if the high bit is set. Both 
codecs are vectorised with both SSE2 and AVX2 implementations. There are also 
Neon implementations, but I don't know their benchmark numbers (note: the 
UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit).

For non-US-ASCII Latin1 text, the performance is more than 5% worse, depending 
on how dense the non-ASCII characters are in the string. But given that we 
want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 
should be rare.

I also have an implementation of UTF-16 to ASCII codec, which is the same as 
UTF-16 to Latin1, but without error checking. That requires that the string 
class store whether it contains only US-ASCII. I've never pushed this to Qt.

> Anyway, we'd have two class templates, and they'd just be
> instantiated with different Char types to flesh out all of the above,
> with the exception of the byte array ones:
> 
>using QUtf8String = QBasicString;
>using QString = QBasicString;
>using QLatin1String = QBasicString;
>(using QByteArray = QVector;)

BTW, I've said this before: QVector should over-allocate by one element and 
memset it to zero, if the element is small enough (4 or 8 bytes). This should 
be done behind the scenes, so the API would never notice it. But it would 
allow transferring the ownership of a QByteArray's payload to any of the other 
classes and still have a null-terminated string.

I don't mind having a QUtf8String{,View} but there needs to be a limit into 
how much we add to its API. Do we have indexOf(char32_t) optimised with 
vectorisation? Do we have indexOf(QRegularExpression)? The latter would make 
us link to libpcre2-8 in addition to libpcre2-16 or would require on-the-fly 
conversions and memory allocations. If your objective is to speed things up, 
having too many methods may actually make it worse.

And then there's the overload set for generic functions. I'm going to insist a 
single, clear rule that does not depend on implementation details and is 
reasonably future-proof. It has to be about *what* the function does, not 
*how* it does that.

> If, after getting all of the above runnig, we _then_ want The One String
> (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we
> need a QAnyString), which can contain any of the 2-4 string (view)
> classes above (but not QByteArray(View)), but which doesn't have
> string-ish API. Instead, you need to inspect it to extract the actual
> string class (QLatin1String, QUtf8String, QString) contained, or simply
> ask for the one you want, and it will convert, if necessary.

Excluding QLatin1String since I don't think we need that, I'm willing to see 
this 

[Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-14 Thread Marc Mutz via Development

Hi Lars,

On 2020-05-12 09:49, Lars Knoll wrote:
[...]

One open question is whether we should add a QUtf8String with a
char8_t. I am not yet convinced that we actually need the class
though.

[...]

I positively want to stop using QByteArray as the QUtf8String that it 
currently is. QByteArray should lose all notion of string-ness 
(deprecate toLower() etc, remove in Qt 7) and be a QVector. 
Not sure we'll get there for Qt 6, not sure we'll get there with the 
name QByteArray, but that should be the end game for this class.


The networking code is full of uses of QByteArray and due to the lack of 
QByteArrayRef (QStringRef) or QByteArrayView (QStringView), it's 
splitting and substringing is much less performant than it could be.


Also, given a function like

   setFoo(const QByteArray &);

what does this actually expect? An UTF-8 string? A local 8-bit string? 
An octet stream? A Latin-1 string? QByteArray is the jack of all these, 
master of none.


So, assuming the premiss that QByteArray should not be string-ish 
anymore, what do we want to have as the result type of QString::toUtf8() 
and QString::toLatin1()? Do we really want mere bytes?


I don't think so.

If Unicode succeeds, most I/O will be in the form of UTF-8. File names 
on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 
(as they are on Windows). It makes a _ton_ of sense to have a container 
for this, and C++20 tempts us with char8_t to do exactly that. I'd love 
to do string processing in UTF-8 without potentially doubling the 
storage requirements by first converting it to UTF-16, then doing the 
processing, then converting it back.


Qt should have a strong story not just for UTF-16, but also for UTF-8.

I've talked about this on QtWS, but here's TL;DV: of it:

value_type  container  viewstring-ish 
API?


char / QLatin1Char— QLatinString — QLatin1StringView — yes
char8_t / qchar8  — QUtf8String  — QUtf8StringView   — yes
char16_t / QChar  — QString  — QStringView   — yes
(char32_t ­— QUtf32String — QUtf32StringView  — yes)

std::byte — QByteArray   — QByteArrayView­— NO

I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, 
provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 
operations are not much slower than L1 <-> utf16 ones (I heard Lars' 
team has them down to within 5% of each other, not sure that's 
possible). Anyway, we'd have two class templates, and they'd just be 
instantiated with different Char types to flesh out all of the above, 
with the exception of the byte array ones:


  using QUtf8String = QBasicString;
  using QString = QBasicString;
  using QLatin1String = QBasicString;
  (using QByteArray = QVector;)

If, after getting all of the above runnig, we _then_ want The One String 
(View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we 
need a QAnyString), which can contain any of the 2-4 string (view) 
classes above (but not QByteArray(View)), but which doesn't have 
string-ish API. Instead, you need to inspect it to extract the actual 
string class (QLatin1String, QUtf8String, QString) contained, or simply 
ask for the one you want, and it will convert, if necessary.


With this, your typical Qt function taking strings would look like this:

   QLineEdit::setText(QAnyStringView text)
   {
   Q_D(QLineEdit);
   if (text == d->text) // mixed-mode comparisons are supported out 
of the box

   return;
   d->text = text.toString(); // centralized conversion to QString 
(in library, not user code)
  // also available: toLatin1(), 
toUtf8()

   update();
   }

Callers now have total freedom in what to pass:

   le->setText("Hi");
   le->setText(u"Hi");
   le->setText(u8"Hi");
   le->setText(u"Hi"s);
   le->setText(u8"Hi"sv);
   le->setText(QVarLengthArray{'H', 'i'});
   le->setText("Hello" % ", World"); // QStringBuilder

and they'd all result in optimal code, because QAnyStringView is a 
trivial type (in the C++ sense), which means, unlike QString, it can be 
passed in CPU registers instead of on the stack.


Likewise, parsing code could do

   Meep parseMeep(QAnyStringView str)
   {
   return str.visit([](auto str) {
   Meep meep;
   for (auto me : str.tokenize(u'\n'))
  meep += parse(me);
   return meep;
   });
   }

iow: instead of a bunch of overloads, you write your code as a template 
and let QAnyStringView instantiate your lambda with the actual type of 
string view passed.


As a further example, here's op== for QAnyStringView (provided by Qt):

   bool operator==(QAnyStringView lhs, QAnyStringView rhs) noexcept
   {
   return lhs.visit([rhs](auto lhs) {
   return rhs.visit([lhs](auto rhs) {
   return lhs == rhs;
   });
   });
   }

Last year, I heard someone (don't remember whom) suggest this for 
QString. That is: allow