Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?

2024-02-06 Thread Thiago Macieira
On Tuesday, 6 February 2024 22:02:11 PST Marc Mutz via Development wrote:
> As someone who argues that qHash("Hello"_L1) be restored to
> qHash(u"Hello"_s) after this relation was broken somewhere in Qt 5, how
> could I argue against it? :)

It was commit ea8e48a6799cf742ea23f4a30dcfc38a4988fe56 in 5.3.0.

commit ea8e48a6799cf742ea23f4a30dcfc38a4988fe56
Author: Thiago Macieira 
Date:   Thu Dec 19 23:32:04 2013 -0800

Update the qHash function for strings to use the CRC32 instruction

According to my profiling of Qt Creator, qHash and the SHA-1 calculation
are the hottest spots remaining in QtCore. The current qHash function is
not really vectorizable. We could come up with a different algorithm
that is more SIMD-friendly, but since we have the CRC32 instruction that
can read 32- and 64-bit entities, we're set.

This commit also updates the benchmark for QHash and benchmarks both
the hashing function itself and the QHash class. The updated
benchmarks for the CRC32 on my machine shows that the hashing function
is *always* improved, but the hashing isn't always. In particular, the
current algorithm is better for the "numbers" case, for which the data
sample differs in very few bits. The new code is 33% slower for that
particular case.

On average, the improvement (including the "numbers" case) is:

 compared to  qHash only  QHash
Qt 5.0 function  2.54x1.06x
Qt 4.x function  4.34x1.34x
Java function2.71x1.11x

Now (current = siphash & murmurhash; nonzero_curent = aeshash; we don't have 
the CRC32 algorithm any more):

RESULT : tst_QHash::hashing_qt4():"numbers":
 126,800.35305 CPU cycles per iteration, 4.66 GHz
 470,120.00894 instructions per iteration, 3.708 instr/cycle 
RESULT : tst_QHash::hashing_qt50():"numbers":
 83,198.10215 CPU cycles per iteration, 4.64 GHz 
 365,099.00604 instructions per iteration, 4.388 instr/cycle 
RESULT : tst_QHash::hashing_current():"numbers":
 151,220.07727 CPU cycles per iteration, 4.66 GHz 
 615,149.01052 instructions per iteration, 4.068 instr/cycle 
RESULT : tst_QHash::hashing_nonzero_current():"numbers":
 65,224.59934 CPU cycles per iteration, 4.45 GHz 
 235,073.00508 instructions per iteration, 3.604 instr/cycle 

So even the pathological "numbers" case is now faster by about 27%.

With "typical" data ("paths-small"):
qt4:23,848.939476 CPU cycles per iteration
qt50:   11,681.268231 CPU cycles per iteration
current:10,623.703491 CPU cycles per iteration
aeshash:2,870.472204 CPU cycles per iteration

The siphash/murmurhash implementation is slightly faster than the Qt 5.0 
implementation and both are over twice as fast as the 4.x implementation. The 
aeshash is on average 3.7x faster than even those two.

And with very long data ("longstrings-list"):
qt4:397,145.21743 CPU cycles per iteration
qt50:   225,105.45316 CPU cycles per iteration
siphash:102,900.78968 CPU cycles per iteration
aeshash:10,249.00974 CPU cycles per iteration

That's 10x faster on my laptop, which has 256-bit AES. With just 128-bit AES, 
my Skylake workstation is doing about 3.5x better than siphash for this case. 
See more numbers in 
https://codereview.qt-project.org/c/qt/qtbase/+/537009
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering


smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?

2024-02-06 Thread Marc Mutz via Development
On 05.02.24 23:42, Thiago Macieira wrote:
> On Monday, 5 February 2024 01:36:39 PST Marc Mutz via Development wrote:
>> I've always understood it such that we as Qt must preserve the property
>> that the hash for equal elements is equal within _one_ run of _one_
>> process. This means you can use the hash in I/O in any way. That's why
>> we have qt_hash, which you _can_ (and do) use in I/O (but is private
>> API, AFAIK). I never thought that a hash seed of zero would change that,
>> but of course users may have come to depend on this (Hyrum's Law), so a
>> [ChangeLog] would be in order.
> 
> In commit e3f05981cbeb0b7721f960ef88effa77be2af5ce, I added this comment to
> qHashBits:
>  // mix in the length as a secondary seed. For seed == 0, seed2 must be
>  // size, to match what we used to do prior to Qt 6.2.
> 
> Which is why I am asking now, because making this change would go against that
> comment. But there was no discussion in the change about whether this was
> correct or not. It seems I just write it like that.
> 
> However, that was qHashBits(). The change I'm talking about is
> qHash(QLatin1StringView), specifically so it won't call qHashBits().

As someone who argues that qHash("Hello"_L1) be restored to 
qHash(u"Hello"_s) after this relation was broken somewhere in Qt 5, how 
could I argue against it? :)

-- 
Marc Mutz 
Principal Software Engineer

The Qt Company
Erich-Thilo-Str. 10 12489
Berlin, Germany
www.qt.io

Geschäftsführer: Mika Pälsi, Juha Varelius, Jouni Lintunen
Sitz der Gesellschaft: Berlin,
Registergericht: Amtsgericht Charlottenburg,
HRB 144331 B

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?

2024-02-06 Thread Marc Mutz via Development
On 05.02.24 19:07, apoenitz wrote:
> On Mon, Feb 05, 2024 at 09:36:39AM +, Marc Mutz via Development wrote:
>> Hi,
>>
>> I've always understood it such that we as Qt must preserve the property
>> that the hash for equal elements is equal within _one_ run of _one_
>> process.
> 
> This would make debugging and testing of applications using QHash
> less deterministic.
> 
> There needs to ba some way to guarantee identical behavior of multiple runs
> of the same binary. Setting the seed to 0 is currently such a way


You're right, my comment was overly simplistic. The non-determinism 
should be in the seed, and nowhere else. So developers should be able to 
rely on hash functions yielding the same value if the same seed is given 
_and the same Qt version is used_. That doesn't mean that applications 
or developers can rely on seed = 0 yielding the same hash value across 
Qt releases (even patch ones).

The end result is the same, though: we can
- definitely change exported qHash() functions
- definitely _not_ change inline qHash() functions
- possibly _not_ change any other qHash() function
   (there shouldn't be any in this category, but historically,
   there were some).

Hope that clears it up.

Thanks,
Marc

-- 
Marc Mutz 
Principal Software Engineer

The Qt Company
Erich-Thilo-Str. 10 12489
Berlin, Germany
www.qt.io

Geschäftsführer: Mika Pälsi, Juha Varelius, Jouni Lintunen
Sitz der Gesellschaft: Berlin,
Registergericht: Amtsgericht Charlottenburg,
HRB 144331 B

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?

2024-02-05 Thread Thiago Macieira
On Monday, 5 February 2024 01:36:39 PST Marc Mutz via Development wrote:
> I've always understood it such that we as Qt must preserve the property
> that the hash for equal elements is equal within _one_ run of _one_
> process. This means you can use the hash in I/O in any way. That's why
> we have qt_hash, which you _can_ (and do) use in I/O (but is private
> API, AFAIK). I never thought that a hash seed of zero would change that,
> but of course users may have come to depend on this (Hyrum's Law), so a
> [ChangeLog] would be in order.

In commit e3f05981cbeb0b7721f960ef88effa77be2af5ce, I added this comment to 
qHashBits:
// mix in the length as a secondary seed. For seed == 0, seed2 must be
// size, to match what we used to do prior to Qt 6.2.

Which is why I am asking now, because making this change would go against that 
comment. But there was no discussion in the change about whether this was 
correct or not. It seems I just write it like that.

However, that was qHashBits(). The change I'm talking about is 
qHash(QLatin1StringView), specifically so it won't call qHashBits().
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering


smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?

2024-02-05 Thread Marc Mutz via Development
Hi,

I've always understood it such that we as Qt must preserve the property 
that the hash for equal elements is equal within _one_ run of _one_ 
process. This means you can use the hash in I/O in any way. That's why 
we have qt_hash, which you _can_ (and do) use in I/O (but is private 
API, AFAIK). I never thought that a hash seed of zero would change that, 
but of course users may have come to depend on this (Hyrum's Law), so a 
[ChangeLog] would be in order.

 From that follows that you can change the hash algorithm, provided the 
hash function is out-of-line exported, but you can't if it's inline (or, 
theoretically, non-exported, because some user may have copied the 
implementation into his application or wrote a conflicting one; which is 
why I'd suggest to =delete non-Qt-provided qHash functions: to prevent 
users from writing their own).

HTH,
Marc

On 03.02.24 22:08, Thiago Macieira wrote:
> Requirement: the qHash function in question is and has always been out-of-
> line. We obviously can't change the hashing function of inline code, because
> it may have been inlined.
> 
> When the seed is non-zero, we make no guarantees on what the result will be.
> The result is allowed to change between Qt versions and between machines, even
> for the same seed. Strictly speaking, you're not supposed to pass replay a
> seed, because some hash functions have a hidden seed you can't get or set.
> 
> But what about a zero seed? It's what we call a "deterministic hashing". We
> have changed the algorithms on .0 releases (in both 5.0 and 6.0 for QString).
> 
> The reason I'm asking is that right now
> 
>   qHash(QLatin1StringView(str)) == qHash(QByteArrayView(str))
> 
> but it would be far more useful to have
> 
>   qHash(QLatin1StringView(str)) == qHash(QString(str))
> 
> I've made the change for the non-zero seeds, but one couldn't rely on the
> above unless it applied for every seed.f
> 
> 
-- 
Marc Mutz 
Principal Software Engineer

The Qt Company
Erich-Thilo-Str. 10 12489
Berlin, Germany
www.qt.io

Geschäftsführer: Mika Pälsi, Juha Varelius, Jouni Lintunen
Sitz der Gesellschaft: Berlin,
Registergericht: Amtsgericht Charlottenburg,
HRB 144331 B

-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?

2024-02-04 Thread Thiago Macieira
On Sunday, 4 February 2024 06:12:18 PST Giuseppe D'Angelo via Development 
wrote:
> Il 03/02/24 22:08, Thiago Macieira ha scritto:
> > But what about a zero seed? It's what we call a "deterministic hashing".
> > We
> > have changed the algorithms on .0 releases (in both 5.0 and 6.0 for
> > QString).
>
> I don't think it means "deterministic" in the sense that the output will
> never change across Qt versions. It just means deterministic across
> multiple runs of the very same application using the very same Qt version.

Indeed but how would it know that it is the same Qt version? It might be 
running with a system Qt that got upgraded in the background, so a new start 
would change it. For that matter, it's far more likely that it's an 
application and its helper at the same time, but even then they can't 
guarantee they're using the same Qt.

> When you disable QHash seeding, the choice of making the seed equal to 0
> is just "a choice"; we could make it 42 or 0xF0CACC1A, for what is
> worth. qHash can still change its output at any time across Qt versions
> and software should never ever rely on that.

I'm not sure.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering


smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?

2024-02-04 Thread Giuseppe D'Angelo via Development

Il 03/02/24 22:08, Thiago Macieira ha scritto:

But what about a zero seed? It's what we call a "deterministic hashing". We
have changed the algorithms on .0 releases (in both 5.0 and 6.0 for QString).


I don't think it means "deterministic" in the sense that the output will 
never change across Qt versions. It just means deterministic across 
multiple runs of the very same application using the very same Qt version.


When you disable QHash seeding, the choice of making the seed equal to 0 
is just "a choice"; we could make it 42 or 0xF0CACC1A, for what is 
worth. qHash can still change its output at any time across Qt versions 
and software should never ever rely on that.


My 2 c,

--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - Trusted Software Excellence



smime.p7s
Description: Firma crittografica S/MIME
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


[Development] What's our policy on changing the result of qHash(T, 0) between major releases?

2024-02-03 Thread Thiago Macieira
Requirement: the qHash function in question is and has always been out-of-
line. We obviously can't change the hashing function of inline code, because 
it may have been inlined.

When the seed is non-zero, we make no guarantees on what the result will be. 
The result is allowed to change between Qt versions and between machines, even 
for the same seed. Strictly speaking, you're not supposed to pass replay a 
seed, because some hash functions have a hidden seed you can't get or set.

But what about a zero seed? It's what we call a "deterministic hashing". We 
have changed the algorithms on .0 releases (in both 5.0 and 6.0 for QString).

The reason I'm asking is that right now

qHash(QLatin1StringView(str)) == qHash(QByteArrayView(str))

but it would be far more useful to have

qHash(QLatin1StringView(str)) == qHash(QString(str))

I've made the change for the non-zero seeds, but one couldn't rely on the 
above unless it applied for every seed.f

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering


smime.p7s
Description: S/MIME cryptographic signature
-- 
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development