So ~ 555 flushes?

I see over 3k samples from Adrien's link in
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#()
I still think the issue is, tokenstreams from analyzers reuse "better"
than ones from StringField, because they have a threadlocal? Whereas
the StringField relies upon the reuse of IndexingChain.PerField.

Maybe it can be better inside IndexWriter, so that it isn't lost on
flush? Just don't cross the tokenstreams. It would be bad :)

On Thu, Oct 21, 2021 at 9:03 AM Michael McCandless
<[email protected]> wrote:
>
> Ahh we are indeed doing that.  The maxBufferedDocs is total-doc-count / 555, 
> to provoke precisely a "5 big segments + 5 medium segments + 5 baby segments" 
> consistent segment geometry in the end.
>
> But that works out to:
>
>     maxBufferedDocs=49774
>
> Which is not too tiny?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <[email protected]> wrote:
>>
>> Yeah, I'm pretty lost in all the ways we index here. But if we are
>> passing maxBufferedDocs <low number> for this deterministic indexing,
>> I think it would cause the issue? I have no idea what the IW config
>> here is...
>>
>> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <[email protected]> wrote:
>> >
>> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <[email protected]> wrote:
>> > >
>> > > But also the internal reuse of IndexingChain.PerField (which houses
>> > > the reused tokenstream) isn't just per-thread, it is
>> > > per-thread-per-segment, right? So if Mike is indexing with 100
>> > > threads, and flushes 200 times, I'd expect 20k of these things to be
>> > > made. There's a lot going on in the benchmark code for nightly and it
>> > > is tricky for me to try to navigate the various cases (1KB,
>> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc)
>> >
>> > I think this might be the case with your link. If you look at the URL
>> > of your actual link, you see it ends with #profiler_4kb_indexing_1_cpu
>> > ?
>> > This makes me think i'm looking at the profiler output of the
>> > "deterministic indexing".
>> > For this one, LogDocMergePolicy is used.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to