Is average word length <= 4 realistic though? I mean, even the english wiki
corpus has ~5, which would require two calls to the lucene layer instead of
one; e.g. multiple layers of virtual dispatch that are unnecessary?

You're not going to pay any cycles for reading 8 bytes instead of 4 bytes,
so the cost of doing so will be the same - while speeding up in cases where
4 isn't quite enough?

Cheers,
Thomas

On Tue, Apr 25, 2023 at 3:07 PM Robert Muir <rcm...@gmail.com> wrote:

> i think from my perspective it has nothing to do with cpus being
> 32-bit or 64-bit and more to do with the average length of terms in
> most languages being smaller than 8. for the languages with longer
> word length, its usually because of complex morphology that most users
> would stem away. so doing 4 bytes at a time seems optimal IMO.
> languages from nature don't care about your cpu.
>
> On Tue, Apr 25, 2023 at 8:52 AM Michael McCandless
> <luc...@mikemccandless.com> wrote:
> >
> > For a truly "pure" indexing test I usually use a single thread for
> indexing, and SerialMergeScheduler (using that single thread to also do
> single-threaded merging).  It makes the indexing take forever lol but it
> produces "comparable" results.
> >
> > But ... this sounds like a great change anyway?  Do we really need to
> gate it on benchmark results?  Do we think there could be a downside e.g.
> slower indexing on (the dwindling) 32 bit CPUs?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Tue, Apr 25, 2023 at 7:39 AM Robert Muir <rcm...@gmail.com> wrote:
> >>
> >> I think the results of the benchmark will depend on the properties of
> >> the indexed terms. For english wikipedia (luceneutil) the average word
> >> length is around 5 bytes so this optimization may not do much.
> >>
> >> On Tue, Apr 25, 2023 at 1:58 AM Patrick Zhai <zhai7...@gmail.com>
> wrote:
> >> >
> >> > I did a quick run with your patch, but since I turned on the CMS as
> well as TieredMergePolicy I'm not sure how fair the comparison is. Here's
> the result:
> >> > Candidate:
> >> > Indexer: indexing done (890209 msec); total 33332620 docs
> >> > Indexer: waitForMerges done (71622 msec)
> >> > Indexer: finished (961877 msec)
> >> > Baseline:
> >> > Indexer: indexing done (909706 msec); total 33332620 docs
> >> > Indexer: waitForMerges done (54775 msec)
> >> > Indexer: finished (964528 msec)
> >> >
> >> > For more accurate comparison I guess it's better to use
> LogxxMergePolicy and turn off CMS? If you want to run it yourself you can
> find the lines I quoted from the log file.
> >> >
> >> > Patrick
> >> >
> >> > On Mon, Apr 24, 2023 at 12:34 PM Thomas Dullien <
> thomas.dull...@elastic.co.invalid> wrote:
> >> >>
> >> >> Hey all,
> >> >>
> >> >> I've been experimenting with fixing some low-hanging performance
> fruit in the ElasticSearch codebase, and came across the fact that the
> MurmurHash implementation that is used by ByteRef.hashCode() is reading 4
> bytes per loop iteration (which is likely an artifact from 32-bit
> architectures, which are ever-less-important). I made a small fix to change
> the implementation to read 8 bytes per loop iteration; I expected a very
> small impact (2-3% CPU or so over an indexing run in ElasticSearch), but
> got a pretty nontrivial throughput improvement over a few indexing
> benchmarks.
> >> >>
> >> >> I tried running Lucene-only benchmarks, and succeeded in running the
> example from https://github.com/mikemccand/luceneutil - but I couldn't
> figure out how to run indexing benchmarks and how to interpret the results.
> >> >>
> >> >> Could someone help me in running the benchmarks for the attached
> patch?
> >> >>
> >> >> Cheers,
> >> >> Thomas
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to