Re: Patch to change murmurhash implementation slightly

Robert Muir Tue, 25 Apr 2023 06:20:10 -0700

if a word is of length 5, processing 8 bytes at a time isn't going to
speed anything up. there aren't 8 bytes to process.


On Tue, Apr 25, 2023 at 9:17 AM Thomas Dullien
<[email protected]> wrote:
>
> Is average word length <= 4 realistic though? I mean, even the english wiki 
> corpus has ~5, which would require two calls to the lucene layer instead of 
> one; e.g. multiple layers of virtual dispatch that are unnecessary?
>
> You're not going to pay any cycles for reading 8 bytes instead of 4 bytes, so 
> the cost of doing so will be the same - while speeding up in cases where 4 
> isn't quite enough?
>
> Cheers,
> Thomas
>
> On Tue, Apr 25, 2023 at 3:07 PM Robert Muir <[email protected]> wrote:
>>
>> i think from my perspective it has nothing to do with cpus being
>> 32-bit or 64-bit and more to do with the average length of terms in
>> most languages being smaller than 8. for the languages with longer
>> word length, its usually because of complex morphology that most users
>> would stem away. so doing 4 bytes at a time seems optimal IMO.
>> languages from nature don't care about your cpu.
>>
>> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless
>> <[email protected]> wrote:
>> >
>> > For a truly "pure" indexing test I usually use a single thread for 
>> > indexing, and SerialMergeScheduler (using that single thread to also do 
>> > single-threaded merging).  It makes the indexing take forever lol but it 
>> > produces "comparable" results.
>> >
>> > But ... this sounds like a great change anyway?  Do we really need to gate 
>> > it on benchmark results?  Do we think there could be a downside e.g. 
>> > slower indexing on (the dwindling) 32 bit CPUs?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir <[email protected]> wrote:
>> >>
>> >> I think the results of the benchmark will depend on the properties of
>> >> the indexed terms. For english wikipedia (luceneutil) the average word
>> >> length is around 5 bytes so this optimization may not do much.
>> >>
>> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai <[email protected]> wrote:
>> >> >
>> >> > I did a quick run with your patch, but since I turned on the CMS as 
>> >> > well as TieredMergePolicy I'm not sure how fair the comparison is. 
>> >> > Here's the result:
>> >> > Candidate:
>> >> > Indexer: indexing done (890209 msec); total 33332620 docs
>> >> > Indexer: waitForMerges done (71622 msec)
>> >> > Indexer: finished (961877 msec)
>> >> > Baseline:
>> >> > Indexer: indexing done (909706 msec); total 33332620 docs
>> >> > Indexer: waitForMerges done (54775 msec)
>> >> > Indexer: finished (964528 msec)
>> >> >
>> >> > For more accurate comparison I guess it's better to use 
>> >> > LogxxMergePolicy and turn off CMS? If you want to run it yourself you 
>> >> > can find the lines I quoted from the log file.
>> >> >
>> >> > Patrick
>> >> >
>> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien 
>> >> > <[email protected]> wrote:
>> >> >>
>> >> >> Hey all,
>> >> >>
>> >> >> I've been experimenting with fixing some low-hanging performance fruit 
>> >> >> in the ElasticSearch codebase, and came across the fact that the 
>> >> >> MurmurHash implementation that is used by ByteRef.hashCode() is 
>> >> >> reading 4 bytes per loop iteration (which is likely an artifact from 
>> >> >> 32-bit architectures, which are ever-less-important). I made a small 
>> >> >> fix to change the implementation to read 8 bytes per loop iteration; I 
>> >> >> expected a very small impact (2-3% CPU or so over an indexing run in 
>> >> >> ElasticSearch), but got a pretty nontrivial throughput improvement 
>> >> >> over a few indexing benchmarks.
>> >> >>
>> >> >> I tried running Lucene-only benchmarks, and succeeded in running the 
>> >> >> example from https://github.com/mikemccand/luceneutil - but I couldn't 
>> >> >> figure out how to run indexing benchmarks and how to interpret the 
>> >> >> results.
>> >> >>
>> >> >> Could someone help me in running the benchmarks for the attached patch?
>> >> >>
>> >> >> Cheers,
>> >> >> Thomas
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: [email protected]
>> >> >> For additional commands, e-mail: [email protected]
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: [email protected]
>> >> For additional commands, e-mail: [email protected]
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Patch to change murmurhash implementation slightly

Reply via email to