Hey Marcus, Thank you for your reply. I'll work on Lucenebench and report back, I'll also check if adding a benchmark for a long-utf-8-word language makes the effect more visible.
I will update this thread next week then :-) Have a good weekend y'all :-) Cheers, Thomas Marcus Eagan <marcusea...@gmail.com> schrieb am Sa., 26. Aug. 2023, 01:04: > Hi Thomas, > > Thank you for the hard work thus far. I'm excited to see if the community > can benefit from the work. The best way to use the lucene bench is to run > the baseline and candidate branches as described here > <https://github.com/mikemccand/luceneutil#preparing-the-benchmark-candidates> > . > > I can help you with it and even submit an update to the benchmark repo as > needed if we find that we can improve some of the steps there to make it > easier for onlookers. Have you already tried setting up lucene_util? > > - Marcus > > > > On Fri, Aug 25, 2023 at 6:34 PM Thomas Dullien > <thomas.dull...@elastic.co.invalid> wrote: > >> Hey all, >> >> apologies if the chart is incorrect. Anyhow, I think the more important >> questions are: >> >> 1) Which benchmarks does the Lucene community have that y'all would like >> to see an improvement on before accepting this (or any other future) >> performance patches? >> >> I'm guessing the reason why the patch improves http_log performance is >> because that benchmark indexes many IP addresses, which tend to be 9-15 >> bytes in length. That does not strike me as an atypical workload. >> >> I've also done some quick experiments to estimate the average UTF-8 word >> size of languages in non-ASCII scripts (for example Hindi), and it seems to >> me the average word size is 2-3 times larger than english because most >> indic characters will encode to 2-3 bytes. The following excerpt from Hindi >> Wikipedia is 2242 bytes, but just 146 words >> ==== begin hindi-example.txt ==== >> १० वीं शताब्दी के बाद घुमन्तु मुस्लिम वंशों ने जातियता तथा धर्म द्वारा >> संघठित तेज घोड़ों से युक्त बड़ी सेना के द्वारा उत्तर-पश्चिमी मैदानों पर बार >> बार आकर्मण किया, अंततः १२०६ [[दिल्ली सल्तनत|इस्लामीक दिल्ली सल्तनत]] की >> स्थापना हुई।{{sfn|Ludden|2002|p = 68}} उन्हें उतर भारत को >> धिक नियंत्रित करना था तथा दक्षिण भारत पर आकर्मण करना था। भारतीय कुलीन >> वर्ग के विघटनकारी सल्तनत ने बड़े पैमाने पर गैर-मुस्लिमों को स्वयं के >> रीतिरिवाजों पर छोड़ दिया।{{sfn|Asher|Talbot|2008|p = >> 47}}{{sfn|Metcalf|Metcalf|2006|p = 6}} १३ वीं शताब्दी में [[मंगोल साम्राज् >> य|मंगोलों]] द्वारा किये के विनाशकारी आकर्मण से भारत की रक्षा की। सल्तनत >> के पतन के कारण स्वशासित [[विजयनगर साम्राज्य|विजयनगर साम्राज्य]] का मार्ग >> प्रशस्त हुआ।{{sfn|Asher|Talbot|2008|p = 53}} एक मजबूत [[शैव|शैव परंपरा]] और >> सल्तनत की सैन्य तकनीक पर निर्माण करते हुए साम्रा >> ज्य ने भारत के विशाल भाग पर शासन किया और इसके बाद लंबे समय तक दक्षिण >> भारतीय समाज को प्रभावित किया।{{sfn|Metcalf|Metcalf|2006|p = >> 12}}{{sfn|Asher|Talbot|2008|p = 53}} >> ==== end hindi-example.txt ==== >> >> cat hindi-example.txt | wc -w >> 146 >> 2242 divided by 146 yields a word length of ~15 bytes, so I'd be >> surprised if average word length of Hindi wikipedia was below 12 bytes. >> >> Do y'all wish for me to create another benchmark for indexing indic >> scripts and large corpora of IPv4 and IPv6 addresses (both things that seem >> not very niche), and if the patch shows improvement there, will y'all >> accept it? >> >> 2) It'd be great if there was some public documentation about "what we >> want to see from a performance improvement before we accept it". I >> personally find the discussion around this patch to be bewilderingly long, >> and I am happy to help work on such a guideline with others -- IMO having a >> clear set of hurdles is preferable to the back-and-forth we've had so far? >> >> Cheers, >> Thomas >> >> >> >> >> >> On Fri, Aug 25, 2023 at 3:38 PM Robert Muir <rcm...@gmail.com> wrote: >> >>> chart is wrong, average word length for english is like 5. >>> >>> On Fri, Aug 25, 2023 at 9:35 AM Thomas Dullien >>> <thomas.dull...@elastic.co.invalid> wrote: >>> > >>> > Hey all, >>> > >>> > another data point: There's a diagram with the relevant distributions >>> of word lengths in various languages here: >>> > >>> > >>> https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/ >>> > >>> > While English is close to the 8-byte limit, average word length in >>> German is 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes. >>> I'll gather some averages over the various Wikipedia indices. >>> > >>> > Cheers, >>> > Thomas >>> > >>> > On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien < >>> thomas.dull...@elastic.co> wrote: >>> >> >>> >> Hey there, >>> >> >>> >> reviving this thread. To clarify: In order to show this patch is >>> worth doing, I should index a bunch of natural-language documents >>> (whichever language that is) and show that the patch brings a performance >>> benefit? >>> >> >>> >> (Just clarifying, because at least inside ElasticSearch for the logs >>> use-case, it turns out that it does provide a performance benefit -- but I >>> want to make sure I understand what the Lucene community wishes to see as >>> "evidence" this is worth pursuing :-) >>> >> >>> >> Cheers, >>> >> Thomas >>> >> >>> >> On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood < >>> wun...@wunderwood.org> wrote: >>> >>> >>> >>> I would recommend some non-English tests. Non-Latin scripts (CJK, >>> Arabic, Hebrew) will have longer byte strings because of UTF8. German has >>> large compound words. >>> >>> >>> >>> wunder >>> >>> Walter Underwood >>> >>> wun...@wunderwood.org >>> >>> http://observer.wunderwood.org/ (my blog) >>> >>> >>> >>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien < >>> thomas.dull...@elastic.co.INVALID> wrote: >>> >>> >>> >>> Hey all, >>> >>> >>> >>> ok, attached is a second patch that adds some unit tests; I am happy >>> to add more. >>> >>> >>> >>> This brings me back to my original question: I'd like to run some >>> pretty thorough benchmarking on Lucene, both for this change and for >>> possible other future changes, largely focused on indexing performance. >>> What are good command lines to do so? What are good corpora? >>> >>> >>> >>> Cheers, >>> >>> Thomas >>> >>> >>> >>> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien < >>> thomas.dull...@elastic.co> wrote: >>> >>>> >>> >>>> Hey, >>> >>>> >>> >>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not >>> publish official test vectors, see the following URLs: >>> >>>> https://github.com/aappleby/smhasher/issues/6 >>> >>>> >>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958 >>> >>>> There is a link to a pastebin entry in the first issue, which leads >>> to https://pastebin.com/kkggV9Vx >>> >>>> >>> >>>> Now, the test vectors in that pastebin do not match either the >>> output of pre-change Lucene's murmur3, nor the output of the Python mmh3 >>> package. That said, the pre-change Lucene and the mmh3 package agree, just >>> not with the published list. >>> >>>> >>> >>>> There *are* test vectors in the source code for the mmh3 python >>> package, which I could use, or cook up a set of bespoke ones, or both (I >>> share the concern about 8-byte boundaries and signedness). >>> >>>> >>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75 >>> >>>> >>> >>>> Cheers, >>> >>>> Thomas >>> >>>> >>> >>>> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir <rcm...@gmail.com> >>> wrote: >>> >>>>> >>> >>>>> i dont think we need a ton of random strings. But if you want to >>> >>>>> optimize for strings of length 8, at a minimum there should be very >>> >>>>> simple tests ensuring correctness for some boundary conditions >>> (e.g. >>> >>>>> string of length exactly 8). i would also strongly recommend >>> testing >>> >>>>> non-ascii since java is a language with signed integer types so it >>> may >>> >>>>> be susceptible to bugs where the input bytes have the "sign bit" >>> set. >>> >>>>> >>> >>>>> IMO this could be 2 simple unit tests. >>> >>>>> >>> >>>>> usually at least with these kinds of algorithms you can also find >>> >>>>> published "test vectors" that intend to seek out the corner cases. >>> if >>> >>>>> these exist for murmurhash, we should fold them in too. >>> >>>>> >>> >>>>> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien >>> >>>>> <thomas.dull...@elastic.co> wrote: >>> >>>>> > >>> >>>>> > Hey, >>> >>>>> > >>> >>>>> > I offered to run a large number of random-string-hashes to >>> ensure that the output is the same pre- and post-change. I can add an >>> arbitrary number of such tests to TestStringHelper.java, just specify the >>> number you wish. >>> >>>>> > >>> >>>>> > If your worry is that my change breaches the inlining bytecode >>> limit: Did you check whether the old version was inlineable or not? The new >>> version is 263 bytecode instructions, the old version was 110. The default >>> inlining limit appears to be 35 bytecode instructions on cursory checking >>> (I may be wrong on this, though), so I don't think it was ever inlineable >>> in default configs. >>> >>>>> > >>> >>>>> > On your statement "we haven't seen performance gains" -- the >>> starting point of this thread was a friendly request to please point me to >>> instructions for running a broad range of Lucene indexing benchmarks, so I >>> can gather data for further discussion; from my perspective, we haven't >>> even gathered any data, so obviously we haven't seen any gains. >>> >>>>> > >>> >>>>> > Cheers, >>> >>>>> > Thomas >>> >>>>> > >>> >>>>> > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir <rcm...@gmail.com> >>> wrote: >>> >>>>> >> >>> >>>>> >> There is literally one string, all-ascii. This won't fail if >>> all the >>> >>>>> >> shifts and masks are wrong. >>> >>>>> >> >>> >>>>> >> About the inlining, i'm not talking about cpu stuff, i'm >>> talking about >>> >>>>> >> java. There are limits to the size of methods that get inlined >>> (e.g. >>> >>>>> >> -XX:MaxInlineSize). If we make this method enormous like this, >>> it may >>> >>>>> >> have performance consequences. >>> >>>>> >> >>> >>>>> >> We still haven't seen any performance gain from this. >>> Elasticsearch >>> >>>>> >> putting huge unique IDs into indexed terms doesnt count. >>> >>>>> >> >>> >>>>> >> On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien >>> >>>>> >> <thomas.dull...@elastic.co> wrote: >>> >>>>> >> > >>> >>>>> >> > Hey, >>> >>>>> >> > >>> >>>>> >> > so there are unit tests in TestStringHelper.java that test >>> strings of length greater than 8, and my change passes them. Could you >>> explain what you want tested? >>> >>>>> >> > >>> >>>>> >> > Cheers, >>> >>>>> >> > Thomas >>> >>>>> >> > >>> >>>>> >> > On Tue, Apr 25, 2023 at 4:21 PM Robert Muir <rcm...@gmail.com> >>> wrote: >>> >>>>> >> >> >>> >>>>> >> >> sure, but "if length > 8 return 1" might pass these same >>> tests too, >>> >>>>> >> >> yet cause a ton of hash collisions. >>> >>>>> >> >> >>> >>>>> >> >> I just think if you want to optimize for super-long strings, >>> there >>> >>>>> >> >> should be a unit test. >>> >>>>> >> >> >>> >>>>> >> >> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien >>> >>>>> >> >> <thomas.dull...@elastic.co> wrote: >>> >>>>> >> >> > >>> >>>>> >> >> > Hey, >>> >>>>> >> >> > >>> >>>>> >> >> > I am pretty confident about correctness. The change passes >>> both Lucene and ES regression tests and my careful reading of the code is >>> pretty certain that the output is the same. If you want me to randomly test >>> the result for a few hundred million random strings, I'm happy to do that, >>> too, if you have other suggestions for correctness testing, let me know. >>> >>>>> >> >> > >>> >>>>> >> >> > The change does increase the method size and may impact >>> inlining - but so does literally any code change, particularly in a JIT'ed >>> environment where placement of code (and hence things like instruction >>> cache conflicts) depend on the precise history of execution. The way I >>> understand it, one deals with this by benchmarking and measuring. >>> >>>>> >> >> > >>> >>>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a >>> noticeable improvement in indexing speed - this is why I was asking about a >>> broad range of Lucene benchmarks; to verify that this is indeed the case >>> for Lucene-only, too. >>> >>>>> >> >> > >>> >>>>> >> >> > Let me know what data you'd like to see to decide whether >>> this patch is a good idea, and if there is consensus among the Lucene >>> committers that those are reasonable criteria, I'll work on producing that >>> data. >>> >>>>> >> >> > >>> >>>>> >> >> > Cheers, >>> >>>>> >> >> > Thomas >>> >>>>> >> >> > >>> >>>>> >> >> > >>> >>>>> >> >> > >>> >>>>> >> >> > On Tue, Apr 25, 2023 at 4:02 PM Robert Muir < >>> rcm...@gmail.com> wrote: >>> >>>>> >> >> >> >>> >>>>> >> >> >> well there is some cost, as it must add additional checks >>> to see if >>> >>>>> >> >> >> its longer than 8. in your patch, additional loops. it >>> increases the >>> >>>>> >> >> >> method size and may impact inlining and other things. >>> also we can't >>> >>>>> >> >> >> forget about correctness, if the hash function does the >>> wrong thing it >>> >>>>> >> >> >> could slow everything to a crawl. >>> >>>>> >> >> >> >>> >>>>> >> >> >> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien >>> >>>>> >> >> >> <thomas.dull...@elastic.co> wrote: >>> >>>>> >> >> >> > >>> >>>>> >> >> >> > Ah, I see what you mean. >>> >>>>> >> >> >> > >>> >>>>> >> >> >> > You are correct -- the change will not speed up a >>> 5-byte word, but it *will* speed up all 8+-byte words, at no cost to the >>> shorter words. >>> >>>>> >> >> >> > >>> >>>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir < >>> rcm...@gmail.com> wrote: >>> >>>>> >> >> >> >> >>> >>>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time >>> isn't going to >>> >>>>> >> >> >> >> speed anything up. there aren't 8 bytes to process. >>> >>>>> >> >> >> >> >>> >>>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien >>> >>>>> >> >> >> >> <thomas.dull...@elastic.co.invalid> wrote: >>> >>>>> >> >> >> >> > >>> >>>>> >> >> >> >> > Is average word length <= 4 realistic though? I >>> mean, even the english wiki corpus has ~5, which would require two calls to >>> the lucene layer instead of one; e.g. multiple layers of virtual dispatch >>> that are unnecessary? >>> >>>>> >> >> >> >> > >>> >>>>> >> >> >> >> > You're not going to pay any cycles for reading 8 >>> bytes instead of 4 bytes, so the cost of doing so will be the same - while >>> speeding up in cases where 4 isn't quite enough? >>> >>>>> >> >> >> >> > >>> >>>>> >> >> >> >> > Cheers, >>> >>>>> >> >> >> >> > Thomas >>> >>>>> >> >> >> >> > >>> >>>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir < >>> rcm...@gmail.com> wrote: >>> >>>>> >> >> >> >> >> >>> >>>>> >> >> >> >> >> i think from my perspective it has nothing to do >>> with cpus being >>> >>>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average >>> length of terms in >>> >>>>> >> >> >> >> >> most languages being smaller than 8. for the >>> languages with longer >>> >>>>> >> >> >> >> >> word length, its usually because of complex >>> morphology that most users >>> >>>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems >>> optimal IMO. >>> >>>>> >> >> >> >> >> languages from nature don't care about your cpu. >>> >>>>> >> >> >> >> >> >>> >>>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless >>> >>>>> >> >> >> >> >> <luc...@mikemccandless.com> wrote: >>> >>>>> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a >>> single thread for indexing, and SerialMergeScheduler (using that single >>> thread to also do single-threaded merging). It makes the indexing take >>> forever lol but it produces "comparable" results. >>> >>>>> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> > But ... this sounds like a great change anyway? >>> Do we really need to gate it on benchmark results? Do we think there could >>> be a downside e.g. slower indexing on (the dwindling) 32 bit CPUs? >>> >>>>> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> > Mike McCandless >>> >>>>> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> > http://blog.mikemccandless.com >>> >>>>> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir < >>> rcm...@gmail.com> wrote: >>> >>>>> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> I think the results of the benchmark will depend >>> on the properties of >>> >>>>> >> >> >> >> >> >> the indexed terms. For english wikipedia >>> (luceneutil) the average word >>> >>>>> >> >> >> >> >> >> length is around 5 bytes so this optimization >>> may not do much. >>> >>>>> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai < >>> zhai7...@gmail.com> wrote: >>> >>>>> >> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I >>> turned on the CMS as well as TieredMergePolicy I'm not sure how fair the >>> comparison is. Here's the result: >>> >>>>> >> >> >> >> >> >> > Candidate: >>> >>>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total >>> 33332620 docs >>> >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec) >>> >>>>> >> >> >> >> >> >> > Indexer: finished (961877 msec) >>> >>>>> >> >> >> >> >> >> > Baseline: >>> >>>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total >>> 33332620 docs >>> >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec) >>> >>>>> >> >> >> >> >> >> > Indexer: finished (964528 msec) >>> >>>>> >> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> >> > For more accurate comparison I guess it's >>> better to use LogxxMergePolicy and turn off CMS? If you want to run it >>> yourself you can find the lines I quoted from the log file. >>> >>>>> >> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> >> > Patrick >>> >>>>> >> >> >> >> >> >> > >>> >>>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas >>> Dullien <thomas.dull...@elastic.co.invalid> wrote: >>> >>>>> >> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> >> Hey all, >>> >>>>> >> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> >> I've been experimenting with fixing some >>> low-hanging performance fruit in the ElasticSearch codebase, and came >>> across the fact that the MurmurHash implementation that is used by >>> ByteRef.hashCode() is reading 4 bytes per loop iteration (which is likely >>> an artifact from 32-bit architectures, which are ever-less-important). I >>> made a small fix to change the implementation to read 8 bytes per loop >>> iteration; I expected a very small impact (2-3% CPU or so over an indexing >>> run in ElasticSearch), but got a pretty nontrivial throughput improvement >>> over a few indexing benchmarks. >>> >>>>> >> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and >>> succeeded in running the example from >>> https://github.com/mikemccand/luceneutil - but I couldn't figure out >>> how to run indexing benchmarks and how to interpret the results. >>> >>>>> >> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> >> Could someone help me in running the >>> benchmarks for the attached patch? >>> >>>>> >> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> >> Cheers, >>> >>>>> >> >> >> >> >> >> >> Thomas >>> >>>>> >> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> >> >>> --------------------------------------------------------------------- >>> >>>>> >> >> >> >> >> >> >> To unsubscribe, e-mail: >>> dev-unsubscr...@lucene.apache.org >>> >>>>> >> >> >> >> >> >> >> For additional commands, e-mail: >>> dev-h...@lucene.apache.org >>> >>>>> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >> >>> --------------------------------------------------------------------- >>> >>>>> >> >> >> >> >> >> To unsubscribe, e-mail: >>> dev-unsubscr...@lucene.apache.org >>> >>>>> >> >> >> >> >> >> For additional commands, e-mail: >>> dev-h...@lucene.apache.org >>> >>>>> >> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >>> >>>>> >> >> >> >> >> >>> --------------------------------------------------------------------- >>> >>>>> >> >> >> >> >> To unsubscribe, e-mail: >>> dev-unsubscr...@lucene.apache.org >>> >>>>> >> >> >> >> >> For additional commands, e-mail: >>> dev-h...@lucene.apache.org >>> >>>>> >> >> >> >> >> >>> >>> >>> >>> <murmur-tests.patch> >>> >>> --------------------------------------------------------------------- >>> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> > > -- > Marcus Eagan > >