Re: Multiple merge-runs from same set of segments

Ravikumar Govindarajan Mon, 24 May 2021 11:49:16 -0700

Thanks Patrick for the help!

May I know what lucene version you're using?
>


We are using an older version of lucene as of now (4.7.x) and I believe the
FilterCodecReader of current version is akin to FilterAtomicReader & should
do the job for us!

If it is not available, I'm not sure whether the merge will happen via merge
> policy, maybe you could check the source code and see?
>

Checked & AFAIK, our old version isn't supporting it. But I guess it should
be fine to wrap a SortingAtomicReader and pass it to the API. Guess, it can
be done!

But I think the current default directory implementation is MMapDirectory,
> which delegate the caching to the system and should have
> already optimized this situation
>

We do use the default MMap-dir but I was actually thinking about
unpacking/walking Term-Dict data (FST) repeatedly from various
threads, even if via MMap. Are there optimizations here (caching unpacked
blocks etc..) that we could tap into?

--
Ravi

On Mon, May 24, 2021 at 11:09 PM Patrick Zhai <[email protected]> wrote:

> Hi Ravi,
>
> 1. May I know what lucene version you're using? As far as I know the
> SortingMergePolicy has been deprecated and replaced by
> IndexWriterConfig.setIndexSort in newer lucene version. So if the
> "setIndexSort" is available I would suggest using that to achieve the
> sorted index (as you might have already figured out, the IndexRearranger
> let you pass in an IndexWriterConfig so that you could set it there). If it
> is not available, I'm not sure whether the merge will happen via merge
> policy, maybe you could check the source code and see?
> 2. Yeah it's a good observation, we're doing multiple passes over one
> segment! But I think the current default directory implementation is
> MMapDirectory, which delegate the caching to the system and should have
> already optimized this situation. Here's a great blog explaining the
> MMapDirectory in lucene:
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Best
> Patrick
>
> Ravikumar Govindarajan <[email protected]> 于2021年5月24日周一
> 上午9:54写道：
>
> > Thanks Michael!
> >
> > This was just what I was looking for!!. Just a couple of questions.
> >
> >
> >    - When we call addIndexes(IndexReader...), does the merge happen via
> >    MergePolicy? We use a SortingMergePolicy and would like to maintain
> the
> >    sort-order in newly created segments too
> >    - Concurrency is a cool-trick here. But if I understand the patch
> >    correctly, don't we end-up doing multiple passes over the Term Dict,
> one
> >    for each Selector? Loading it fully in memory could help here,
> possibly?
> >
> > --
> > Ravi
> >
> > On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
> > [email protected]> wrote:
> >
> > > Are you trying to rewrite your already created index into a different
> > > segment geometry?
> > >
> > > Maybe have a look at the new IndexRearranger tool
> > > <https://issues.apache.org/jira/browse/LUCENE-9694>?  It is already
> > doing
> > > something like what you enumerated below, including mocking LiveDocs to
> > get
> > > the right documents into the right segments.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
> > > [email protected]> wrote:
> > >
> > >> Hello,
> > >>
> > >> We have a use-case for index-rewrite on a "frozen index" where no new
> > >> documents are added. It goes like this..
> > >>
> > >>    1. Get all segments for the index (base-segment-list)
> > >>    2. Create a new segment from base-segment-list with unique set of
> > docs
> > >>    (LiveDocs)
> > >>    3. Repeat step 2, for a fixed count. Like say 5 or 10 times
> > >>
> > >> Is something like this achievable via Merge Policy? We can disable
> > commits
> > >> too, till the full run is completed.
> > >>
> > >> Any help is appreciated
> > >>
> > >> Regards,
> > >> Ravi
> > >>
> > >
> >
>

Re: Multiple merge-runs from same set of segments

Reply via email to