Hey there,

reviving this thread. To clarify: In order to show this patch is worth
doing, I should index a bunch of natural-language documents (whichever
language that is) and show that the patch brings a performance benefit?

(Just clarifying, because at least inside ElasticSearch for the logs
use-case, it turns out that it does provide a performance benefit -- but I
want to make sure I understand what the Lucene community wishes to see as
"evidence" this is worth pursuing :-)

Cheers,
Thomas

On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood <wun...@wunderwood.org>
wrote:

> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic,
> Hebrew) will have longer byte strings because of UTF8. German has large
> compound words.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Apr 25, 2023, at 10:57 AM, Thomas Dullien
> <thomas.dull...@elastic.co.INVALID> wrote:
>
> Hey all,
>
> ok, attached is a second patch that adds some unit tests; I am happy to
> add more.
>
> This brings me back to my original question: I'd like to run some pretty
> thorough benchmarking on Lucene, both for this change and for possible
> other future changes, largely focused on indexing performance. What are
> good command lines to do so? What are good corpora?
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien <thomas.dull...@elastic.co>
> wrote:
>
>> Hey,
>>
>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish
>> official test vectors, see the following URLs:
>> https://github.com/aappleby/smhasher/issues/6
>>
>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>> There is a link to a pastebin entry in the first issue, which leads to
>> https://pastebin.com/kkggV9Vx
>>
>> Now, the test vectors in that pastebin do not match either the output of
>> pre-change Lucene's murmur3, nor the output of the Python mmh3 package.
>> That said, the pre-change Lucene and the mmh3 package agree, just not with
>> the published list.
>>
>> There *are* test vectors in the source code for the mmh3 python package,
>> which I could use, or cook up a set of bespoke ones, or both (I share the
>> concern about 8-byte boundaries and signedness).
>>
>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>>
>> Cheers,
>> Thomas
>>
>> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir <rcm...@gmail.com> wrote:
>>
>>> i dont think we need a ton of random strings. But if you want to
>>> optimize for strings of length 8, at a minimum there should be very
>>> simple tests ensuring correctness for some boundary conditions (e.g.
>>> string of length exactly 8). i would also strongly recommend testing
>>> non-ascii since java is a language with signed integer types so it may
>>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>>
>>> IMO this could be 2 simple unit tests.
>>>
>>> usually at least with these kinds of algorithms you can also find
>>> published "test vectors" that intend to seek out the corner cases. if
>>> these exist for murmurhash, we should fold them in too.
>>>
>>> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien
>>> <thomas.dull...@elastic.co> wrote:
>>> >
>>> > Hey,
>>> >
>>> > I offered to run a large number of random-string-hashes to ensure that
>>> the output is the same pre- and post-change. I can add an arbitrary number
>>> of such tests to TestStringHelper.java, just specify the number you wish.
>>> >
>>> > If your worry is that my change breaches the inlining bytecode limit:
>>> Did you check whether the old version was inlineable or not? The new
>>> version is 263 bytecode instructions, the old version was 110. The default
>>> inlining limit appears to be 35 bytecode instructions on cursory checking
>>> (I may be wrong on this, though), so I don't think it was ever inlineable
>>> in default configs.
>>> >
>>> > On your statement "we haven't seen performance gains" -- the starting
>>> point of this thread was a friendly request to please point me to
>>> instructions for running a broad range of Lucene indexing benchmarks, so I
>>> can gather data for further discussion; from my perspective, we haven't
>>> even gathered any data, so obviously we haven't seen any gains.
>>> >
>>> > Cheers,
>>> > Thomas
>>> >
>>> > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir <rcm...@gmail.com> wrote:
>>> >>
>>> >> There is literally one string, all-ascii. This won't fail if all the
>>> >> shifts and masks are wrong.
>>> >>
>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about
>>> >> java. There are limits to the size of methods that get inlined (e.g.
>>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may
>>> >> have performance consequences.
>>> >>
>>> >> We still haven't seen any performance gain from this. Elasticsearch
>>> >> putting huge unique IDs into indexed terms doesnt count.
>>> >>
>>> >> On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien
>>> >> <thomas.dull...@elastic.co> wrote:
>>> >> >
>>> >> > Hey,
>>> >> >
>>> >> > so there are unit tests in TestStringHelper.java that test strings
>>> of length greater than 8, and my change passes them. Could you explain what
>>> you want tested?
>>> >> >
>>> >> > Cheers,
>>> >> > Thomas
>>> >> >
>>> >> > On Tue, Apr 25, 2023 at 4:21 PM Robert Muir <rcm...@gmail.com>
>>> wrote:
>>> >> >>
>>> >> >> sure, but "if length > 8 return 1" might pass these same tests too,
>>> >> >> yet cause a ton of hash collisions.
>>> >> >>
>>> >> >> I just think if you want to optimize for super-long strings, there
>>> >> >> should be a unit test.
>>> >> >>
>>> >> >> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien
>>> >> >> <thomas.dull...@elastic.co> wrote:
>>> >> >> >
>>> >> >> > Hey,
>>> >> >> >
>>> >> >> > I am pretty confident about correctness. The change passes both
>>> Lucene and ES regression tests and my careful reading of the code is pretty
>>> certain that the output is the same. If you want me to randomly test the
>>> result for a few hundred million random strings, I'm happy to do that, too,
>>> if you have other suggestions for correctness testing, let me know.
>>> >> >> >
>>> >> >> > The change does increase the method size and may impact inlining
>>> - but so does literally any code change, particularly in a JIT'ed
>>> environment where placement of code (and hence things like instruction
>>> cache conflicts) depend on the precise history of execution. The way I
>>> understand it, one deals with this by benchmarking and measuring.
>>> >> >> >
>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable
>>> improvement in indexing speed - this is why I was asking about a broad
>>> range of Lucene benchmarks; to verify that this is indeed the case for
>>> Lucene-only, too.
>>> >> >> >
>>> >> >> > Let me know what data you'd like to see to decide whether this
>>> patch is a good idea, and if there is consensus among the Lucene committers
>>> that those are reasonable criteria, I'll work on producing that data.
>>> >> >> >
>>> >> >> > Cheers,
>>> >> >> > Thomas
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > On Tue, Apr 25, 2023 at 4:02 PM Robert Muir <rcm...@gmail.com>
>>> wrote:
>>> >> >> >>
>>> >> >> >> well there is some cost, as it must add additional checks to
>>> see if
>>> >> >> >> its longer than 8. in your patch, additional loops. it
>>> increases the
>>> >> >> >> method size and may impact inlining and other things. also we
>>> can't
>>> >> >> >> forget about correctness, if the hash function does the wrong
>>> thing it
>>> >> >> >> could slow everything to a crawl.
>>> >> >> >>
>>> >> >> >> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien
>>> >> >> >> <thomas.dull...@elastic.co> wrote:
>>> >> >> >> >
>>> >> >> >> > Ah, I see what you mean.
>>> >> >> >> >
>>> >> >> >> > You are correct -- the change will not speed up a 5-byte
>>> word, but it *will* speed up all 8+-byte words, at no cost to the shorter
>>> words.
>>> >> >> >> >
>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir <rcm...@gmail.com>
>>> wrote:
>>> >> >> >> >>
>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't
>>> going to
>>> >> >> >> >> speed anything up. there aren't 8 bytes to process.
>>> >> >> >> >>
>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien
>>> >> >> >> >> <thomas.dull...@elastic.co.invalid> wrote:
>>> >> >> >> >> >
>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even
>>> the english wiki corpus has ~5, which would require two calls to the lucene
>>> layer instead of one; e.g. multiple layers of virtual dispatch that are
>>> unnecessary?
>>> >> >> >> >> >
>>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes
>>> instead of 4 bytes, so the cost of doing so will be the same - while
>>> speeding up in cases where 4 isn't quite enough?
>>> >> >> >> >> >
>>> >> >> >> >> > Cheers,
>>> >> >> >> >> > Thomas
>>> >> >> >> >> >
>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir <
>>> rcm...@gmail.com> wrote:
>>> >> >> >> >> >>
>>> >> >> >> >> >> i think from my perspective it has nothing to do with
>>> cpus being
>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length
>>> of terms in
>>> >> >> >> >> >> most languages being smaller than 8. for the languages
>>> with longer
>>> >> >> >> >> >> word length, its usually because of complex morphology
>>> that most users
>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal
>>> IMO.
>>> >> >> >> >> >> languages from nature don't care about your cpu.
>>> >> >> >> >> >>
>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless
>>> >> >> >> >> >> <luc...@mikemccandless.com> wrote:
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single
>>> thread for indexing, and SerialMergeScheduler (using that single thread to
>>> also do single-threaded merging).  It makes the indexing take forever lol
>>> but it produces "comparable" results.
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > But ... this sounds like a great change anyway?  Do we
>>> really need to gate it on benchmark results?  Do we think there could be a
>>> downside e.g. slower indexing on (the dwindling) 32 bit CPUs?
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > Mike McCandless
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > http://blog.mikemccandless.com
>>> >> >> >> >> >> >
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir <
>>> rcm...@gmail.com> wrote:
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> I think the results of the benchmark will depend on
>>> the properties of
>>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil)
>>> the average word
>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not
>>> do much.
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai <
>>> zhai7...@gmail.com> wrote:
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I
>>> turned on the CMS as well as TieredMergePolicy I'm not sure how fair the
>>> comparison is. Here's the result:
>>> >> >> >> >> >> >> > Candidate:
>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620
>>> docs
>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec)
>>> >> >> >> >> >> >> > Indexer: finished (961877 msec)
>>> >> >> >> >> >> >> > Baseline:
>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620
>>> docs
>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec)
>>> >> >> >> >> >> >> > Indexer: finished (964528 msec)
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to
>>> use LogxxMergePolicy and turn off CMS? If you want to run it yourself you
>>> can find the lines I quoted from the log file.
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > Patrick
>>> >> >> >> >> >> >> >
>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien <
>>> thomas.dull...@elastic.co.invalid> wrote:
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Hey all,
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> I've been experimenting with fixing some
>>> low-hanging performance fruit in the ElasticSearch codebase, and came
>>> across the fact that the MurmurHash implementation that is used by
>>> ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely
>>> an artifact from 32-bit architectures, which are ever-less-important). I
>>> made a small fix to change the implementation to read 8 bytes per loop
>>> iteration; I expected a very small impact (2-3% CPU or so over an indexing
>>> run in ElasticSearch), but got a pretty nontrivial throughput improvement
>>> over a few indexing benchmarks.
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and
>>> succeeded in running the example from
>>> https://github.com/mikemccand/luceneutil - but I couldn't figure out
>>> how to run indexing benchmarks and how to interpret the results.
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for
>>> the attached patch?
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >> Cheers,
>>> >> >> >> >> >> >> >> Thomas
>>> >> >> >> >> >> >> >>
>>> >> >> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> >> >> >> >> >> To unsubscribe, e-mail:
>>> dev-unsubscr...@lucene.apache.org
>>> >> >> >> >> >> >> >> For additional commands, e-mail:
>>> dev-h...@lucene.apache.org
>>> >> >> >> >> >> >>
>>> >> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> >> >> >> >> To unsubscribe, e-mail:
>>> dev-unsubscr...@lucene.apache.org
>>> >> >> >> >> >> >> For additional commands, e-mail:
>>> dev-h...@lucene.apache.org
>>> >> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> ---------------------------------------------------------------------
>>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> >> >> >> >> For additional commands, e-mail:
>>> dev-h...@lucene.apache.org
>>> >> >> >> >> >>
>>>
>> <murmur-tests.patch>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>
>

Reply via email to