Re: Patch to change murmurhash implementation slightly

Thomas Dullien Tue, 25 Apr 2023 10:58:22 -0700

Hey all,

ok, attached is a second patch that adds some unit tests; I am happy to add
more.


This brings me back to my original question: I'd like to run some pretty
thorough benchmarking on Lucene, both for this change and for possible
other future changes, largely focused on indexing performance. What are
good command lines to do so? What are good corpora?

Cheers,
Thomas

On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien <[email protected]>
wrote:

> Hey,
>
> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish
> official test vectors, see the following URLs:
> https://github.com/aappleby/smhasher/issues/6
>
> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
> There is a link to a pastebin entry in the first issue, which leads to
> https://pastebin.com/kkggV9Vx
>
> Now, the test vectors in that pastebin do not match either the output of
> pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
> That said, the pre-change Lucene and the mmh3 package agree, just not with
> the published list.
>
> There *are* test vectors in the source code for the mmh3 python package,
> which I could use, or cook up a set of bespoke ones, or both (I share the
> concern about 8-byte boundaries and signedness).
>
> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir <[email protected]> wrote:
>
>> i dont think we need a ton of random strings. But if you want to
>> optimize for strings of length 8, at a minimum there should be very
>> simple tests ensuring correctness for some boundary conditions (e.g.
>> string of length exactly 8). i would also strongly recommend testing
>> non-ascii since java is a language with signed integer types so it may
>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>
>> IMO this could be 2 simple unit tests.
>>
>> usually at least with these kinds of algorithms you can also find
>> published "test vectors" that intend to seek out the corner cases. if
>> these exist for murmurhash, we should fold them in too.
>>
>> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien
>> <[email protected]> wrote:
>> >
>> > Hey,
>> >
>> > I offered to run a large number of random-string-hashes to ensure that
>> the output is the same pre- and post-change. I can add an arbitrary number
>> of such tests to TestStringHelper.java, just specify the number you wish.
>> >
>> > If your worry is that my change breaches the inlining bytecode limit:
>> Did you check whether the old version was inlineable or not? The new
>> version is 263 bytecode instructions, the old version was 110. The default
>> inlining limit appears to be 35 bytecode instructions on cursory checking
>> (I may be wrong on this, though), so I don't think it was ever inlineable
>> in default configs.
>> >
>> > On your statement "we haven't seen performance gains" -- the starting
>> point of this thread was a friendly request to please point me to
>> instructions for running a broad range of Lucene indexing benchmarks, so I
>> can gather data for further discussion; from my perspective, we haven't
>> even gathered any data, so obviously we haven't seen any gains.
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir <[email protected]> wrote:
>> >>
>> >> There is literally one string, all-ascii. This won't fail if all the
>> >> shifts and masks are wrong.
>> >>
>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about
>> >> java. There are limits to the size of methods that get inlined (e.g.
>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may
>> >> have performance consequences.
>> >>
>> >> We still haven't seen any performance gain from this. Elasticsearch
>> >> putting huge unique IDs into indexed terms doesnt count.
>> >>
>> >> On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien
>> >> <[email protected]> wrote:
>> >> >
>> >> > Hey,
>> >> >
>> >> > so there are unit tests in TestStringHelper.java that test strings
>> of length greater than 8, and my change passes them. Could you explain what
>> you want tested?
>> >> >
>> >> > Cheers,
>> >> > Thomas
>> >> >
>> >> > On Tue, Apr 25, 2023 at 4:21 PM Robert Muir <[email protected]>
>> wrote:
>> >> >>
>> >> >> sure, but "if length > 8 return 1" might pass these same tests too,
>> >> >> yet cause a ton of hash collisions.
>> >> >>
>> >> >> I just think if you want to optimize for super-long strings, there
>> >> >> should be a unit test.
>> >> >>
>> >> >> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien
>> >> >> <[email protected]> wrote:
>> >> >> >
>> >> >> > Hey,
>> >> >> >
>> >> >> > I am pretty confident about correctness. The change passes both
>> Lucene and ES regression tests and my careful reading of the code is pretty
>> certain that the output is the same. If you want me to randomly test the
>> result for a few hundred million random strings, I'm happy to do that, too,
>> if you have other suggestions for correctness testing, let me know.
>> >> >> >
>> >> >> > The change does increase the method size and may impact inlining
>> - but so does literally any code change, particularly in a JIT'ed
>> environment where placement of code (and hence things like instruction
>> cache conflicts) depend on the precise history of execution. The way I
>> understand it, one deals with this by benchmarking and measuring.
>> >> >> >
>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable
>> improvement in indexing speed - this is why I was asking about a broad
>> range of Lucene benchmarks; to verify that this is indeed the case for
>> Lucene-only, too.
>> >> >> >
>> >> >> > Let me know what data you'd like to see to decide whether this
>> patch is a good idea, and if there is consensus among the Lucene committers
>> that those are reasonable criteria, I'll work on producing that data.
>> >> >> >
>> >> >> > Cheers,
>> >> >> > Thomas
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Apr 25, 2023 at 4:02 PM Robert Muir <[email protected]>
>> wrote:
>> >> >> >>
>> >> >> >> well there is some cost, as it must add additional checks to see
>> if
>> >> >> >> its longer than 8. in your patch, additional loops. it increases
>> the
>> >> >> >> method size and may impact inlining and other things. also we
>> can't
>> >> >> >> forget about correctness, if the hash function does the wrong
>> thing it
>> >> >> >> could slow everything to a crawl.
>> >> >> >>
>> >> >> >> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien
>> >> >> >> <[email protected]> wrote:
>> >> >> >> >
>> >> >> >> > Ah, I see what you mean.
>> >> >> >> >
>> >> >> >> > You are correct -- the change will not speed up a 5-byte word,
>> but it *will* speed up all 8+-byte words, at no cost to the shorter words.
>> >> >> >> >
>> >> >> >> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir <[email protected]>
>> wrote:
>> >> >> >> >>
>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't
>> going to
>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>> >> >> >> >>
>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien
>> >> >> >> >> <[email protected]> wrote:
>> >> >> >> >> >
>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even
>> the english wiki corpus has ~5, which would require two calls to the lucene
>> layer instead of one; e.g. multiple layers of virtual dispatch that are
>> unnecessary?
>> >> >> >> >> >
>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes
>> instead of 4 bytes, so the cost of doing so will be the same - while
>> speeding up in cases where 4 isn't quite enough?
>> >> >> >> >> >
>> >> >> >> >> > Cheers,
>> >> >> >> >> > Thomas
>> >> >> >> >> >
>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir <
>> [email protected]> wrote:
>> >> >> >> >> >>
>> >> >> >> >> >> i think from my perspective it has nothing to do with cpus
>> being
>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length of
>> terms in
>> >> >> >> >> >> most languages being smaller than 8. for the languages
>> with longer
>> >> >> >> >> >> word length, its usually because of complex morphology
>> that most users
>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal
>> IMO.
>> >> >> >> >> >> languages from nature don't care about your cpu.
>> >> >> >> >> >>
>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless
>> >> >> >> >> >> <[email protected]> wrote:
>> >> >> >> >> >> >
>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single
>> thread for indexing, and SerialMergeScheduler (using that single thread to
>> also do single-threaded merging).  It makes the indexing take forever lol
>> but it produces "comparable" results.
>> >> >> >> >> >> >
>> >> >> >> >> >> > But ... this sounds like a great change anyway?  Do we
>> really need to gate it on benchmark results?  Do we think there could be a
>> downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>> >> >> >> >> >> >
>> >> >> >> >> >> > Mike McCandless
>> >> >> >> >> >> >
>> >> >> >> >> >> > http://blog.mikemccandless.com
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir <
>> [email protected]> wrote:
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I think the results of the benchmark will depend on the
>> properties of
>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil)
>> the average word
>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not
>> do much.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai <
>> [email protected]> wrote:
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > I did a quick run with your patch, but since I turned
>> on the CMS as well as TieredMergePolicy I'm not sure how fair the
>> comparison is. Here's the result:
>> >> >> >> >> >> >> > Candidate:
>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620
>> docs
>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>> >> >> >> >> >> >> > Baseline:
>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620
>> docs
>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to
>> use LogxxMergePolicy and turn off CMS? If you want to run it yourself you
>> can find the lines I quoted from the log file.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > Patrick
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien <
>> [email protected]> wrote:
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Hey all,
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging
>> performance fruit in the ElasticSearch codebase, and came across the fact
>> that the MurmurHash implementation that is used by ByteRef.hashCode() is
>> reading 4 bytes per loop iteration (which is likely an artifact from 32-bit
>> architectures, which are ever-less-important). I made a small fix to change
>> the implementation to read 8 bytes per loop iteration; I expected a very
>> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
>> got a pretty nontrivial throughput improvement over a few indexing
>> benchmarks.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and
>> succeeded in running the example from
>> https://github.com/mikemccand/luceneutil - but I couldn't figure out how
>> to run indexing benchmarks and how to interpret the results.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for
>> the attached patch?
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Cheers,
>> >> >> >> >> >> >> >> Thomas
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >> >> >> >> >> >> >> To unsubscribe, e-mail:
>> [email protected]
>> >> >> >> >> >> >> >> For additional commands, e-mail:
>> [email protected]
>> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >> >> >> >> >> >> To unsubscribe, e-mail:
>> [email protected]
>> >> >> >> >> >> >> For additional commands, e-mail:
>> [email protected]
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >>
>> ---------------------------------------------------------------------
>> >> >> >> >> >> To unsubscribe, e-mail: [email protected]
>> >> >> >> >> >> For additional commands, e-mail:
>> [email protected]
>> >> >> >> >> >>
>>
>

diff --git a/lucene/core/src/test/org/apache/lucene/util/TestStringHelper.java b/lucene/core/src/test/org/apache/lucene/util/TestStringHelper.java
index b78c9a07bb9..eb3296d0064 100644
--- a/lucene/core/src/test/org/apache/lucene/util/TestStringHelper.java
+++ b/lucene/core/src/test/org/apache/lucene/util/TestStringHelper.java
@@ -77,6 +77,18 @@ public class TestStringHelper extends LuceneTestCase {
             newBytesRef(
                 "You want weapons? We're in a library! Books! The best weapons in the world!"),
             3476));
+
+    // There do not seem to exist good official MurmurHash3 test vectors, but the following vectors
+    // agree with the implementation shipped with the mmh3 python library. They specifically test
+    // the corner cases "4 bytes", "non-ascii-bytes", "non-ascii-bytes with nonzero initialization
+    // vector", and "input data that when interpreted as long is negative, with nonzero init".
+    assertEquals(0x5A97808A, StringHelper.murmurhash3_x86_32(newBytesRef("aaaa"), 0x9747b28c));
+    assertEquals(0xec72b6e8, StringHelper.murmurhash3_x86_32(newBytesRef("ππππππππ"), 0));
+    assertEquals(0xd58063c1, StringHelper.murmurhash3_x86_32(newBytesRef("ππππππππ"), 0x9747b28c));
+    assertEquals(0xb4f3eb2e, StringHelper.murmurhash3_x86_32(newBytesRef("\uFFFF\uFFFF\uFFFF\uFFFF"), 
+      0x9747b28c));
+    assertEquals(0x7309ab9b, StringHelper.murmurhash3_x86_32(newBytesRef(
+      "\uFFFF\uFFFF\uFFFF\uFFFF\uFFFF\uFFFF\uFFFF\uFFFF"), 0x9747b28c));
   }
 
   public void testSortKeyLength() throws Exception {

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Patch to change murmurhash implementation slightly

Reply via email to