LOL don't cross the tokenstreams!

Yeah should be 555 or 556 flushes I think.  Probably times the number of
indexed fields, gets us to the 3K count?

+1 to improve IW's internal re-use in the non-analyzed StringField case.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 21, 2021 at 9:14 AM Robert Muir <[email protected]> wrote:

> So ~ 555 flushes?
>
> I see over 3k samples from Adrien's link in
>
> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#()
> I still think the issue is, tokenstreams from analyzers reuse "better"
> than ones from StringField, because they have a threadlocal? Whereas
> the StringField relies upon the reuse of IndexingChain.PerField.
>
> Maybe it can be better inside IndexWriter, so that it isn't lost on
> flush? Just don't cross the tokenstreams. It would be bad :)
>
> On Thu, Oct 21, 2021 at 9:03 AM Michael McCandless
> <[email protected]> wrote:
> >
> > Ahh we are indeed doing that.  The maxBufferedDocs is total-doc-count /
> 555, to provoke precisely a "5 big segments + 5 medium segments + 5 baby
> segments" consistent segment geometry in the end.
> >
> > But that works out to:
> >
> >     maxBufferedDocs=49774
> >
> > Which is not too tiny?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <[email protected]> wrote:
> >>
> >> Yeah, I'm pretty lost in all the ways we index here. But if we are
> >> passing maxBufferedDocs <low number> for this deterministic indexing,
> >> I think it would cause the issue? I have no idea what the IW config
> >> here is...
> >>
> >> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <[email protected]> wrote:
> >> >
> >> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <[email protected]> wrote:
> >> > >
> >> > > But also the internal reuse of IndexingChain.PerField (which houses
> >> > > the reused tokenstream) isn't just per-thread, it is
> >> > > per-thread-per-segment, right? So if Mike is indexing with 100
> >> > > threads, and flushes 200 times, I'd expect 20k of these things to be
> >> > > made. There's a lot going on in the benchmark code for nightly and
> it
> >> > > is tricky for me to try to navigate the various cases (1KB,
> >> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc)
> >> >
> >> > I think this might be the case with your link. If you look at the URL
> >> > of your actual link, you see it ends with #profiler_4kb_indexing_1_cpu
> >> > ?
> >> > This makes me think i'm looking at the profiler output of the
> >> > "deterministic indexing".
> >> > For this one, LogDocMergePolicy is used.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to