Hey all, another data point: There's a diagram with the relevant distributions of word lengths in various languages here:
https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/ While English is close to the 8-byte limit, average word length in German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes. I'll gather some averages over the various Wikipedia indices. Cheers, Thomas On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien <thomas.dull...@elastic.co> wrote: > Hey there, > > reviving this thread. To clarify: In order to show this patch is worth > doing, I should index a bunch of natural-language documents (whichever > language that is) and show that the patch brings a performance benefit? > > (Just clarifying, because at least inside ElasticSearch for the logs > use-case, it turns out that it does provide a performance benefit -- but I > want to make sure I understand what the Lucene community wishes to see as > "evidence" this is worth pursuing :-) > > Cheers, > Thomas > > On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood <wun...@wunderwood.org> > wrote: > >> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, >> Hebrew) will have longer byte strings because of UTF8. German has large >> compound words. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> On Apr 25, 2023, at 10:57 AM, Thomas Dullien >> <thomas.dull...@elastic.co.INVALID> wrote: >> >> Hey all, >> >> ok, attached is a second patch that adds some unit tests; I am happy to >> add more. >> >> This brings me back to my original question: I'd like to run some pretty >> thorough benchmarking on Lucene, both for this change and for possible >> other future changes, largely focused on indexing performance. What are >> good command lines to do so? What are good corpora? >> >> Cheers, >> Thomas >> >> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien <thomas.dull...@elastic.co> >> wrote: >> >>> Hey, >>> >>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish >>> official test vectors, see the following URLs: >>> https://github.com/aappleby/smhasher/issues/6 >>> >>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958 >>> There is a link to a pastebin entry in the first issue, which leads to >>> https://pastebin.com/kkggV9Vx >>> >>> Now, the test vectors in that pastebin do not match either the output of >>> pre-change Lucene's murmur3, nor the output of the Python mmh3 package. >>> That said, the pre-change Lucene and the mmh3 package agree, just not with >>> the published list. >>> >>> There *are* test vectors in the source code for the mmh3 python package, >>> which I could use, or cook up a set of bespoke ones, or both (I share the >>> concern about 8-byte boundaries and signedness). >>> >>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75 >>> >>> Cheers, >>> Thomas >>> >>> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir <rcm...@gmail.com> wrote: >>> >>>> i dont think we need a ton of random strings. But if you want to >>>> optimize for strings of length 8, at a minimum there should be very >>>> simple tests ensuring correctness for some boundary conditions (e.g. >>>> string of length exactly 8). i would also strongly recommend testing >>>> non-ascii since java is a language with signed integer types so it may >>>> be susceptible to bugs where the input bytes have the "sign bit" set. >>>> >>>> IMO this could be 2 simple unit tests. >>>> >>>> usually at least with these kinds of algorithms you can also find >>>> published "test vectors" that intend to seek out the corner cases. if >>>> these exist for murmurhash, we should fold them in too. >>>> >>>> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien >>>> <thomas.dull...@elastic.co> wrote: >>>> > >>>> > Hey, >>>> > >>>> > I offered to run a large number of random-string-hashes to ensure >>>> that the output is the same pre- and post-change. I can add an arbitrary >>>> number of such tests to TestStringHelper.java, just specify the number you >>>> wish. >>>> > >>>> > If your worry is that my change breaches the inlining bytecode limit: >>>> Did you check whether the old version was inlineable or not? The new >>>> version is 263 bytecode instructions, the old version was 110. The default >>>> inlining limit appears to be 35 bytecode instructions on cursory checking >>>> (I may be wrong on this, though), so I don't think it was ever inlineable >>>> in default configs. >>>> > >>>> > On your statement "we haven't seen performance gains" -- the starting >>>> point of this thread was a friendly request to please point me to >>>> instructions for running a broad range of Lucene indexing benchmarks, so I >>>> can gather data for further discussion; from my perspective, we haven't >>>> even gathered any data, so obviously we haven't seen any gains. >>>> > >>>> > Cheers, >>>> > Thomas >>>> > >>>> > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir <rcm...@gmail.com> wrote: >>>> >> >>>> >> There is literally one string, all-ascii. This won't fail if all the >>>> >> shifts and masks are wrong. >>>> >> >>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking >>>> about >>>> >> java. There are limits to the size of methods that get inlined (e.g. >>>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may >>>> >> have performance consequences. >>>> >> >>>> >> We still haven't seen any performance gain from this. Elasticsearch >>>> >> putting huge unique IDs into indexed terms doesnt count. >>>> >> >>>> >> On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien >>>> >> <thomas.dull...@elastic.co> wrote: >>>> >> > >>>> >> > Hey, >>>> >> > >>>> >> > so there are unit tests in TestStringHelper.java that test strings >>>> of length greater than 8, and my change passes them. Could you explain what >>>> you want tested? >>>> >> > >>>> >> > Cheers, >>>> >> > Thomas >>>> >> > >>>> >> > On Tue, Apr 25, 2023 at 4:21 PM Robert Muir <rcm...@gmail.com> >>>> wrote: >>>> >> >> >>>> >> >> sure, but "if length > 8 return 1" might pass these same tests >>>> too, >>>> >> >> yet cause a ton of hash collisions. >>>> >> >> >>>> >> >> I just think if you want to optimize for super-long strings, there >>>> >> >> should be a unit test. >>>> >> >> >>>> >> >> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien >>>> >> >> <thomas.dull...@elastic.co> wrote: >>>> >> >> > >>>> >> >> > Hey, >>>> >> >> > >>>> >> >> > I am pretty confident about correctness. The change passes both >>>> Lucene and ES regression tests and my careful reading of the code is pretty >>>> certain that the output is the same. If you want me to randomly test the >>>> result for a few hundred million random strings, I'm happy to do that, too, >>>> if you have other suggestions for correctness testing, let me know. >>>> >> >> > >>>> >> >> > The change does increase the method size and may impact >>>> inlining - but so does literally any code change, particularly in a JIT'ed >>>> environment where placement of code (and hence things like instruction >>>> cache conflicts) depend on the precise history of execution. The way I >>>> understand it, one deals with this by benchmarking and measuring. >>>> >> >> > >>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable >>>> improvement in indexing speed - this is why I was asking about a broad >>>> range of Lucene benchmarks; to verify that this is indeed the case for >>>> Lucene-only, too. >>>> >> >> > >>>> >> >> > Let me know what data you'd like to see to decide whether this >>>> patch is a good idea, and if there is consensus among the Lucene committers >>>> that those are reasonable criteria, I'll work on producing that data. >>>> >> >> > >>>> >> >> > Cheers, >>>> >> >> > Thomas >>>> >> >> > >>>> >> >> > >>>> >> >> > >>>> >> >> > On Tue, Apr 25, 2023 at 4:02 PM Robert Muir <rcm...@gmail.com> >>>> wrote: >>>> >> >> >> >>>> >> >> >> well there is some cost, as it must add additional checks to >>>> see if >>>> >> >> >> its longer than 8. in your patch, additional loops. it >>>> increases the >>>> >> >> >> method size and may impact inlining and other things. also we >>>> can't >>>> >> >> >> forget about correctness, if the hash function does the wrong >>>> thing it >>>> >> >> >> could slow everything to a crawl. >>>> >> >> >> >>>> >> >> >> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien >>>> >> >> >> <thomas.dull...@elastic.co> wrote: >>>> >> >> >> > >>>> >> >> >> > Ah, I see what you mean. >>>> >> >> >> > >>>> >> >> >> > You are correct -- the change will not speed up a 5-byte >>>> word, but it *will* speed up all 8+-byte words, at no cost to the shorter >>>> words. >>>> >> >> >> > >>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir < >>>> rcm...@gmail.com> wrote: >>>> >> >> >> >> >>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time >>>> isn't going to >>>> >> >> >> >> speed anything up. there aren't 8 bytes to process. >>>> >> >> >> >> >>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien >>>> >> >> >> >> <thomas.dull...@elastic.co.invalid> wrote: >>>> >> >> >> >> > >>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, >>>> even the english wiki corpus has ~5, which would require two calls to the >>>> lucene layer instead of one; e.g. multiple layers of virtual dispatch that >>>> are unnecessary? >>>> >> >> >> >> > >>>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes >>>> instead of 4 bytes, so the cost of doing so will be the same - while >>>> speeding up in cases where 4 isn't quite enough? >>>> >> >> >> >> > >>>> >> >> >> >> > Cheers, >>>> >> >> >> >> > Thomas >>>> >> >> >> >> > >>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir < >>>> rcm...@gmail.com> wrote: >>>> >> >> >> >> >> >>>> >> >> >> >> >> i think from my perspective it has nothing to do with >>>> cpus being >>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length >>>> of terms in >>>> >> >> >> >> >> most languages being smaller than 8. for the languages >>>> with longer >>>> >> >> >> >> >> word length, its usually because of complex morphology >>>> that most users >>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems >>>> optimal IMO. >>>> >> >> >> >> >> languages from nature don't care about your cpu. >>>> >> >> >> >> >> >>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless >>>> >> >> >> >> >> <luc...@mikemccandless.com> wrote: >>>> >> >> >> >> >> > >>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a >>>> single thread for indexing, and SerialMergeScheduler (using that single >>>> thread to also do single-threaded merging). It makes the indexing take >>>> forever lol but it produces "comparable" results. >>>> >> >> >> >> >> > >>>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we >>>> really need to gate it on benchmark results? Do we think there could be a >>>> downside e.g. slower indexing on (the dwindling) 32 bit CPUs? >>>> >> >> >> >> >> > >>>> >> >> >> >> >> > Mike McCandless >>>> >> >> >> >> >> > >>>> >> >> >> >> >> > http://blog.mikemccandless.com >>>> >> >> >> >> >> > >>>> >> >> >> >> >> > >>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir < >>>> rcm...@gmail.com> wrote: >>>> >> >> >> >> >> >> >>>> >> >> >> >> >> >> I think the results of the benchmark will depend on >>>> the properties of >>>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) >>>> the average word >>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not >>>> do much. >>>> >> >> >> >> >> >> >>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai < >>>> zhai7...@gmail.com> wrote: >>>> >> >> >> >> >> >> > >>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I >>>> turned on the CMS as well as TieredMergePolicy I'm not sure how fair the >>>> comparison is. Here's the result: >>>> >> >> >> >> >> >> > Candidate: >>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total >>>> 33332620 docs >>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec) >>>> >> >> >> >> >> >> > Indexer: finished (961877 msec) >>>> >> >> >> >> >> >> > Baseline: >>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total >>>> 33332620 docs >>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec) >>>> >> >> >> >> >> >> > Indexer: finished (964528 msec) >>>> >> >> >> >> >> >> > >>>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to >>>> use LogxxMergePolicy and turn off CMS? If you want to run it yourself you >>>> can find the lines I quoted from the log file. >>>> >> >> >> >> >> >> > >>>> >> >> >> >> >> >> > Patrick >>>> >> >> >> >> >> >> > >>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien < >>>> thomas.dull...@elastic.co.invalid> wrote: >>>> >> >> >> >> >> >> >> >>>> >> >> >> >> >> >> >> Hey all, >>>> >> >> >> >> >> >> >> >>>> >> >> >> >> >> >> >> I've been experimenting with fixing some >>>> low-hanging performance fruit in the ElasticSearch codebase, and came >>>> across the fact that the MurmurHash implementation that is used by >>>> ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely >>>> an artifact from 32-bit architectures, which are ever-less-important). I >>>> made a small fix to change the implementation to read 8 bytes per loop >>>> iteration; I expected a very small impact (2-3% CPU or so over an indexing >>>> run in ElasticSearch), but got a pretty nontrivial throughput improvement >>>> over a few indexing benchmarks. >>>> >> >> >> >> >> >> >> >>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and >>>> succeeded in running the example from >>>> https://github.com/mikemccand/luceneutil - but I couldn't figure out >>>> how to run indexing benchmarks and how to interpret the results. >>>> >> >> >> >> >> >> >> >>>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks >>>> for the attached patch? >>>> >> >> >> >> >> >> >> >>>> >> >> >> >> >> >> >> Cheers, >>>> >> >> >> >> >> >> >> Thomas >>>> >> >> >> >> >> >> >> >>>> >> >> >> >> >> >> >> >>>> --------------------------------------------------------------------- >>>> >> >> >> >> >> >> >> To unsubscribe, e-mail: >>>> dev-unsubscr...@lucene.apache.org >>>> >> >> >> >> >> >> >> For additional commands, e-mail: >>>> dev-h...@lucene.apache.org >>>> >> >> >> >> >> >> >>>> >> >> >> >> >> >> >>>> --------------------------------------------------------------------- >>>> >> >> >> >> >> >> To unsubscribe, e-mail: >>>> dev-unsubscr...@lucene.apache.org >>>> >> >> >> >> >> >> For additional commands, e-mail: >>>> dev-h...@lucene.apache.org >>>> >> >> >> >> >> >> >>>> >> >> >> >> >> >>>> >> >> >> >> >> >>>> --------------------------------------------------------------------- >>>> >> >> >> >> >> To unsubscribe, e-mail: >>>> dev-unsubscr...@lucene.apache.org >>>> >> >> >> >> >> For additional commands, e-mail: >>>> dev-h...@lucene.apache.org >>>> >> >> >> >> >> >>>> >>> <murmur-tests.patch> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> >>