chart is wrong, average word length for english is like 5. On Fri, Aug 25, 2023 at 9:35 AM Thomas Dullien <thomas.dull...@elastic.co.invalid> wrote: > > Hey all, > > another data point: There's a diagram with the relevant distributions of word > lengths in various languages here: > > https://www.reddit.com/r/languagelearning/comments/h9eao2/average_word_length_of_languages_in_europe_except/ > > While English is close to the 8-byte limit, average word length in German is > 11+ bytes, and Mongolian and Finnish will likewise be 11+ bytes. I'll gather > some averages over the various Wikipedia indices. > > Cheers, > Thomas > > On Thu, Aug 24, 2023 at 2:09 PM Thomas Dullien <thomas.dull...@elastic.co> > wrote: >> >> Hey there, >> >> reviving this thread. To clarify: In order to show this patch is worth >> doing, I should index a bunch of natural-language documents (whichever >> language that is) and show that the patch brings a performance benefit? >> >> (Just clarifying, because at least inside ElasticSearch for the logs >> use-case, it turns out that it does provide a performance benefit -- but I >> want to make sure I understand what the Lucene community wishes to see as >> "evidence" this is worth pursuing :-) >> >> Cheers, >> Thomas >> >> On Tue, Apr 25, 2023 at 8:14 PM Walter Underwood <wun...@wunderwood.org> >> wrote: >>> >>> I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, >>> Hebrew) will have longer byte strings because of UTF8. German has large >>> compound words. >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>> On Apr 25, 2023, at 10:57 AM, Thomas Dullien >>> <thomas.dull...@elastic.co.INVALID> wrote: >>> >>> Hey all, >>> >>> ok, attached is a second patch that adds some unit tests; I am happy to add >>> more. >>> >>> This brings me back to my original question: I'd like to run some pretty >>> thorough benchmarking on Lucene, both for this change and for possible >>> other future changes, largely focused on indexing performance. What are >>> good command lines to do so? What are good corpora? >>> >>> Cheers, >>> Thomas >>> >>> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien <thomas.dull...@elastic.co> >>> wrote: >>>> >>>> Hey, >>>> >>>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish >>>> official test vectors, see the following URLs: >>>> https://github.com/aappleby/smhasher/issues/6 >>>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958 >>>> There is a link to a pastebin entry in the first issue, which leads to >>>> https://pastebin.com/kkggV9Vx >>>> >>>> Now, the test vectors in that pastebin do not match either the output of >>>> pre-change Lucene's murmur3, nor the output of the Python mmh3 package. >>>> That said, the pre-change Lucene and the mmh3 package agree, just not with >>>> the published list. >>>> >>>> There *are* test vectors in the source code for the mmh3 python package, >>>> which I could use, or cook up a set of bespoke ones, or both (I share the >>>> concern about 8-byte boundaries and signedness). >>>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75 >>>> >>>> Cheers, >>>> Thomas >>>> >>>> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir <rcm...@gmail.com> wrote: >>>>> >>>>> i dont think we need a ton of random strings. But if you want to >>>>> optimize for strings of length 8, at a minimum there should be very >>>>> simple tests ensuring correctness for some boundary conditions (e.g. >>>>> string of length exactly 8). i would also strongly recommend testing >>>>> non-ascii since java is a language with signed integer types so it may >>>>> be susceptible to bugs where the input bytes have the "sign bit" set. >>>>> >>>>> IMO this could be 2 simple unit tests. >>>>> >>>>> usually at least with these kinds of algorithms you can also find >>>>> published "test vectors" that intend to seek out the corner cases. if >>>>> these exist for murmurhash, we should fold them in too. >>>>> >>>>> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien >>>>> <thomas.dull...@elastic.co> wrote: >>>>> > >>>>> > Hey, >>>>> > >>>>> > I offered to run a large number of random-string-hashes to ensure that >>>>> > the output is the same pre- and post-change. I can add an arbitrary >>>>> > number of such tests to TestStringHelper.java, just specify the number >>>>> > you wish. >>>>> > >>>>> > If your worry is that my change breaches the inlining bytecode limit: >>>>> > Did you check whether the old version was inlineable or not? The new >>>>> > version is 263 bytecode instructions, the old version was 110. The >>>>> > default inlining limit appears to be 35 bytecode instructions on >>>>> > cursory checking (I may be wrong on this, though), so I don't think it >>>>> > was ever inlineable in default configs. >>>>> > >>>>> > On your statement "we haven't seen performance gains" -- the starting >>>>> > point of this thread was a friendly request to please point me to >>>>> > instructions for running a broad range of Lucene indexing benchmarks, >>>>> > so I can gather data for further discussion; from my perspective, we >>>>> > haven't even gathered any data, so obviously we haven't seen any gains. >>>>> > >>>>> > Cheers, >>>>> > Thomas >>>>> > >>>>> > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir <rcm...@gmail.com> wrote: >>>>> >> >>>>> >> There is literally one string, all-ascii. This won't fail if all the >>>>> >> shifts and masks are wrong. >>>>> >> >>>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about >>>>> >> java. There are limits to the size of methods that get inlined (e.g. >>>>> >> -XX:MaxInlineSize). If we make this method enormous like this, it may >>>>> >> have performance consequences. >>>>> >> >>>>> >> We still haven't seen any performance gain from this. Elasticsearch >>>>> >> putting huge unique IDs into indexed terms doesnt count. >>>>> >> >>>>> >> On Tue, Apr 25, 2023 at 10:25 AM Thomas Dullien >>>>> >> <thomas.dull...@elastic.co> wrote: >>>>> >> > >>>>> >> > Hey, >>>>> >> > >>>>> >> > so there are unit tests in TestStringHelper.java that test strings >>>>> >> > of length greater than 8, and my change passes them. Could you >>>>> >> > explain what you want tested? >>>>> >> > >>>>> >> > Cheers, >>>>> >> > Thomas >>>>> >> > >>>>> >> > On Tue, Apr 25, 2023 at 4:21 PM Robert Muir <rcm...@gmail.com> wrote: >>>>> >> >> >>>>> >> >> sure, but "if length > 8 return 1" might pass these same tests too, >>>>> >> >> yet cause a ton of hash collisions. >>>>> >> >> >>>>> >> >> I just think if you want to optimize for super-long strings, there >>>>> >> >> should be a unit test. >>>>> >> >> >>>>> >> >> On Tue, Apr 25, 2023 at 10:20 AM Thomas Dullien >>>>> >> >> <thomas.dull...@elastic.co> wrote: >>>>> >> >> > >>>>> >> >> > Hey, >>>>> >> >> > >>>>> >> >> > I am pretty confident about correctness. The change passes both >>>>> >> >> > Lucene and ES regression tests and my careful reading of the code >>>>> >> >> > is pretty certain that the output is the same. If you want me to >>>>> >> >> > randomly test the result for a few hundred million random >>>>> >> >> > strings, I'm happy to do that, too, if you have other suggestions >>>>> >> >> > for correctness testing, let me know. >>>>> >> >> > >>>>> >> >> > The change does increase the method size and may impact inlining >>>>> >> >> > - but so does literally any code change, particularly in a JIT'ed >>>>> >> >> > environment where placement of code (and hence things like >>>>> >> >> > instruction cache conflicts) depend on the precise history of >>>>> >> >> > execution. The way I understand it, one deals with this by >>>>> >> >> > benchmarking and measuring. >>>>> >> >> > >>>>> >> >> > FWIW, several indexing-heavy ES benchmarks show a noticeable >>>>> >> >> > improvement in indexing speed - this is why I was asking about a >>>>> >> >> > broad range of Lucene benchmarks; to verify that this is indeed >>>>> >> >> > the case for Lucene-only, too. >>>>> >> >> > >>>>> >> >> > Let me know what data you'd like to see to decide whether this >>>>> >> >> > patch is a good idea, and if there is consensus among the Lucene >>>>> >> >> > committers that those are reasonable criteria, I'll work on >>>>> >> >> > producing that data. >>>>> >> >> > >>>>> >> >> > Cheers, >>>>> >> >> > Thomas >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > >>>>> >> >> > On Tue, Apr 25, 2023 at 4:02 PM Robert Muir <rcm...@gmail.com> >>>>> >> >> > wrote: >>>>> >> >> >> >>>>> >> >> >> well there is some cost, as it must add additional checks to see >>>>> >> >> >> if >>>>> >> >> >> its longer than 8. in your patch, additional loops. it increases >>>>> >> >> >> the >>>>> >> >> >> method size and may impact inlining and other things. also we >>>>> >> >> >> can't >>>>> >> >> >> forget about correctness, if the hash function does the wrong >>>>> >> >> >> thing it >>>>> >> >> >> could slow everything to a crawl. >>>>> >> >> >> >>>>> >> >> >> On Tue, Apr 25, 2023 at 9:56 AM Thomas Dullien >>>>> >> >> >> <thomas.dull...@elastic.co> wrote: >>>>> >> >> >> > >>>>> >> >> >> > Ah, I see what you mean. >>>>> >> >> >> > >>>>> >> >> >> > You are correct -- the change will not speed up a 5-byte word, >>>>> >> >> >> > but it *will* speed up all 8+-byte words, at no cost to the >>>>> >> >> >> > shorter words. >>>>> >> >> >> > >>>>> >> >> >> > On Tue, Apr 25, 2023 at 3:20 PM Robert Muir <rcm...@gmail.com> >>>>> >> >> >> > wrote: >>>>> >> >> >> >> >>>>> >> >> >> >> if a word is of length 5, processing 8 bytes at a time isn't >>>>> >> >> >> >> going to >>>>> >> >> >> >> speed anything up. there aren't 8 bytes to process. >>>>> >> >> >> >> >>>>> >> >> >> >> On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien >>>>> >> >> >> >> <thomas.dull...@elastic.co.invalid> wrote: >>>>> >> >> >> >> > >>>>> >> >> >> >> > Is average word length <= 4 realistic though? I mean, even >>>>> >> >> >> >> > the english wiki corpus has ~5, which would require two >>>>> >> >> >> >> > calls to the lucene layer instead of one; e.g. multiple >>>>> >> >> >> >> > layers of virtual dispatch that are unnecessary? >>>>> >> >> >> >> > >>>>> >> >> >> >> > You're not going to pay any cycles for reading 8 bytes >>>>> >> >> >> >> > instead of 4 bytes, so the cost of doing so will be the >>>>> >> >> >> >> > same - while speeding up in cases where 4 isn't quite >>>>> >> >> >> >> > enough? >>>>> >> >> >> >> > >>>>> >> >> >> >> > Cheers, >>>>> >> >> >> >> > Thomas >>>>> >> >> >> >> > >>>>> >> >> >> >> > On Tue, Apr 25, 2023 at 3:07 PM Robert Muir >>>>> >> >> >> >> > <rcm...@gmail.com> wrote: >>>>> >> >> >> >> >> >>>>> >> >> >> >> >> i think from my perspective it has nothing to do with cpus >>>>> >> >> >> >> >> being >>>>> >> >> >> >> >> 32-bit or 64-bit and more to do with the average length of >>>>> >> >> >> >> >> terms in >>>>> >> >> >> >> >> most languages being smaller than 8. for the languages >>>>> >> >> >> >> >> with longer >>>>> >> >> >> >> >> word length, its usually because of complex morphology >>>>> >> >> >> >> >> that most users >>>>> >> >> >> >> >> would stem away. so doing 4 bytes at a time seems optimal >>>>> >> >> >> >> >> IMO. >>>>> >> >> >> >> >> languages from nature don't care about your cpu. >>>>> >> >> >> >> >> >>>>> >> >> >> >> >> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless >>>>> >> >> >> >> >> <luc...@mikemccandless.com> wrote: >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> > For a truly "pure" indexing test I usually use a single >>>>> >> >> >> >> >> > thread for indexing, and SerialMergeScheduler (using >>>>> >> >> >> >> >> > that single thread to also do single-threaded merging). >>>>> >> >> >> >> >> > It makes the indexing take forever lol but it produces >>>>> >> >> >> >> >> > "comparable" results. >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> > But ... this sounds like a great change anyway? Do we >>>>> >> >> >> >> >> > really need to gate it on benchmark results? Do we >>>>> >> >> >> >> >> > think there could be a downside e.g. slower indexing on >>>>> >> >> >> >> >> > (the dwindling) 32 bit CPUs? >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> > Mike McCandless >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> > http://blog.mikemccandless.com >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> > >>>>> >> >> >> >> >> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir >>>>> >> >> >> >> >> > <rcm...@gmail.com> wrote: >>>>> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> I think the results of the benchmark will depend on the >>>>> >> >> >> >> >> >> properties of >>>>> >> >> >> >> >> >> the indexed terms. For english wikipedia (luceneutil) >>>>> >> >> >> >> >> >> the average word >>>>> >> >> >> >> >> >> length is around 5 bytes so this optimization may not >>>>> >> >> >> >> >> >> do much. >>>>> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai >>>>> >> >> >> >> >> >> <zhai7...@gmail.com> wrote: >>>>> >> >> >> >> >> >> > >>>>> >> >> >> >> >> >> > I did a quick run with your patch, but since I turned >>>>> >> >> >> >> >> >> > on the CMS as well as TieredMergePolicy I'm not sure >>>>> >> >> >> >> >> >> > how fair the comparison is. Here's the result: >>>>> >> >> >> >> >> >> > Candidate: >>>>> >> >> >> >> >> >> > Indexer: indexing done (890209 msec); total 33332620 >>>>> >> >> >> >> >> >> > docs >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (71622 msec) >>>>> >> >> >> >> >> >> > Indexer: finished (961877 msec) >>>>> >> >> >> >> >> >> > Baseline: >>>>> >> >> >> >> >> >> > Indexer: indexing done (909706 msec); total 33332620 >>>>> >> >> >> >> >> >> > docs >>>>> >> >> >> >> >> >> > Indexer: waitForMerges done (54775 msec) >>>>> >> >> >> >> >> >> > Indexer: finished (964528 msec) >>>>> >> >> >> >> >> >> > >>>>> >> >> >> >> >> >> > For more accurate comparison I guess it's better to >>>>> >> >> >> >> >> >> > use LogxxMergePolicy and turn off CMS? If you want to >>>>> >> >> >> >> >> >> > run it yourself you can find the lines I quoted from >>>>> >> >> >> >> >> >> > the log file. >>>>> >> >> >> >> >> >> > >>>>> >> >> >> >> >> >> > Patrick >>>>> >> >> >> >> >> >> > >>>>> >> >> >> >> >> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien >>>>> >> >> >> >> >> >> > <thomas.dull...@elastic.co.invalid> wrote: >>>>> >> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> >> Hey all, >>>>> >> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> >> I've been experimenting with fixing some low-hanging >>>>> >> >> >> >> >> >> >> performance fruit in the ElasticSearch codebase, and >>>>> >> >> >> >> >> >> >> came across the fact that the MurmurHash >>>>> >> >> >> >> >> >> >> implementation that is used by ByteRef.hashCode() is >>>>> >> >> >> >> >> >> >> reading 4 bytes per loop iteration (which is likely >>>>> >> >> >> >> >> >> >> an artifact from 32-bit architectures, which are >>>>> >> >> >> >> >> >> >> ever-less-important). I made a small fix to change >>>>> >> >> >> >> >> >> >> the implementation to read 8 bytes per loop >>>>> >> >> >> >> >> >> >> iteration; I expected a very small impact (2-3% CPU >>>>> >> >> >> >> >> >> >> or so over an indexing run in ElasticSearch), but >>>>> >> >> >> >> >> >> >> got a pretty nontrivial throughput improvement over >>>>> >> >> >> >> >> >> >> a few indexing benchmarks. >>>>> >> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> >> I tried running Lucene-only benchmarks, and >>>>> >> >> >> >> >> >> >> succeeded in running the example from >>>>> >> >> >> >> >> >> >> https://github.com/mikemccand/luceneutil - but I >>>>> >> >> >> >> >> >> >> couldn't figure out how to run indexing benchmarks >>>>> >> >> >> >> >> >> >> and how to interpret the results. >>>>> >> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> >> Could someone help me in running the benchmarks for >>>>> >> >> >> >> >> >> >> the attached patch? >>>>> >> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> >> Cheers, >>>>> >> >> >> >> >> >> >> Thomas >>>>> >> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> >> --------------------------------------------------------------------- >>>>> >> >> >> >> >> >> >> To unsubscribe, e-mail: >>>>> >> >> >> >> >> >> >> dev-unsubscr...@lucene.apache.org >>>>> >> >> >> >> >> >> >> For additional commands, e-mail: >>>>> >> >> >> >> >> >> >> dev-h...@lucene.apache.org >>>>> >> >> >> >> >> >> >>>>> >> >> >> >> >> >> --------------------------------------------------------------------- >>>>> >> >> >> >> >> >> To unsubscribe, e-mail: >>>>> >> >> >> >> >> >> dev-unsubscr...@lucene.apache.org >>>>> >> >> >> >> >> >> For additional commands, e-mail: >>>>> >> >> >> >> >> >> dev-h...@lucene.apache.org >>>>> >> >> >> >> >> >> >>>>> >> >> >> >> >> >>>>> >> >> >> >> >> --------------------------------------------------------------------- >>>>> >> >> >> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> >> >> >> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >> >> >> >> >> >>> >>> <murmur-tests.patch> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>>
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org