Re: Multiple merge-runs from same set of segments

Patrick Zhai Thu, 27 May 2021 10:14:08 -0700

Sorry for the delayed response, as for caching termDict data across
threads, I do not aware of any existing lucene mechanism could do that (and
it might be tricky since it is across threads), but maybe worth trying to
see whether we can get some extra speed based on that!


Patrick

Ravikumar Govindarajan <[email protected]> 于2021年5月24日周一
上午11:49写道：

> Thanks Patrick for the help!
>
> May I know what lucene version you're using?
> >
>
> We are using an older version of lucene as of now (4.7.x) and I believe the
> FilterCodecReader of current version is akin to FilterAtomicReader & should
> do the job for us!
>
> If it is not available, I'm not sure whether the merge will happen via
> merge
> > policy, maybe you could check the source code and see?
> >
>
> Checked & AFAIK, our old version isn't supporting it. But I guess it should
> be fine to wrap a SortingAtomicReader and pass it to the API. Guess, it can
> be done!
>
> But I think the current default directory implementation is MMapDirectory,
> > which delegate the caching to the system and should have
> > already optimized this situation
> >
>
> We do use the default MMap-dir but I was actually thinking about
> unpacking/walking Term-Dict data (FST) repeatedly from various
> threads, even if via MMap. Are there optimizations here (caching unpacked
> blocks etc..) that we could tap into?
>
> --
> Ravi
>
> On Mon, May 24, 2021 at 11:09 PM Patrick Zhai <[email protected]> wrote:
>
> > Hi Ravi,
> >
> > 1. May I know what lucene version you're using? As far as I know the
> > SortingMergePolicy has been deprecated and replaced by
> > IndexWriterConfig.setIndexSort in newer lucene version. So if the
> > "setIndexSort" is available I would suggest using that to achieve the
> > sorted index (as you might have already figured out, the IndexRearranger
> > let you pass in an IndexWriterConfig so that you could set it there). If
> it
> > is not available, I'm not sure whether the merge will happen via merge
> > policy, maybe you could check the source code and see?
> > 2. Yeah it's a good observation, we're doing multiple passes over one
> > segment! But I think the current default directory implementation is
> > MMapDirectory, which delegate the caching to the system and should have
> > already optimized this situation. Here's a great blog explaining the
> > MMapDirectory in lucene:
> > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > Best
> > Patrick
> >
> > Ravikumar Govindarajan <[email protected]> 于2021年5月24日周一
> > 上午9:54写道：
> >
> > > Thanks Michael!
> > >
> > > This was just what I was looking for!!. Just a couple of questions.
> > >
> > >
> > >    - When we call addIndexes(IndexReader...), does the merge happen via
> > >    MergePolicy? We use a SortingMergePolicy and would like to maintain
> > the
> > >    sort-order in newly created segments too
> > >    - Concurrency is a cool-trick here. But if I understand the patch
> > >    correctly, don't we end-up doing multiple passes over the Term Dict,
> > one
> > >    for each Selector? Loading it fully in memory could help here,
> > possibly?
> > >
> > > --
> > > Ravi
> > >
> > > On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
> > > [email protected]> wrote:
> > >
> > > > Are you trying to rewrite your already created index into a different
> > > > segment geometry?
> > > >
> > > > Maybe have a look at the new IndexRearranger tool
> > > > <https://issues.apache.org/jira/browse/LUCENE-9694>?  It is already
> > > doing
> > > > something like what you enumerated below, including mocking LiveDocs
> to
> > > get
> > > > the right documents into the right segments.
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > >
> > > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
> > > > [email protected]> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> We have a use-case for index-rewrite on a "frozen index" where no
> new
> > > >> documents are added. It goes like this..
> > > >>
> > > >>    1. Get all segments for the index (base-segment-list)
> > > >>    2. Create a new segment from base-segment-list with unique set of
> > > docs
> > > >>    (LiveDocs)
> > > >>    3. Repeat step 2, for a fixed count. Like say 5 or 10 times
> > > >>
> > > >> Is something like this achievable via Merge Policy? We can disable
> > > commits
> > > >> too, till the full run is completed.
> > > >>
> > > >> Any help is appreciated
> > > >>
> > > >> Regards,
> > > >> Ravi
> > > >>
> > > >
> > >
> >
>

Re: Multiple merge-runs from same set of segments

Reply via email to