Re: Multiple merge-runs from same set of segments

Ravikumar Govindarajan Sun, 30 May 2021 04:07:12 -0700

Yeah, like you said, it looks quite tough to load entire FST in-mem. Plus
we have to address concurrent access


I guess I have to first see how the current patch goes before any
optimisations are done

Thanks for the help!

—
Ravi

On Thu, 27 May 2021 at 10:44 PM, Patrick Zhai <zhai7...@gmail.com> wrote:

> Sorry for the delayed response, as for caching termDict data across
>
> threads, I do not aware of any existing lucene mechanism could do that (and
>
> it might be tricky since it is across threads), but maybe worth trying to
>
> see whether we can get some extra speed based on that!
>
>
>
> Patrick
>
>
>
> Ravikumar Govindarajan <ravikumar.govindara...@gmail.com> 于2021年5月24日周一
>
> 上午11:49写道：
>
>
>
> > Thanks Patrick for the help!
>
> >
>
> > May I know what lucene version you're using?
>
> > >
>
> >
>
> > We are using an older version of lucene as of now (4.7.x) and I believe
> the
>
> > FilterCodecReader of current version is akin to FilterAtomicReader &
> should
>
> > do the job for us!
>
> >
>
> > If it is not available, I'm not sure whether the merge will happen via
>
> > merge
>
> > > policy, maybe you could check the source code and see?
>
> > >
>
> >
>
> > Checked & AFAIK, our old version isn't supporting it. But I guess it
> should
>
> > be fine to wrap a SortingAtomicReader and pass it to the API. Guess, it
> can
>
> > be done!
>
> >
>
> > But I think the current default directory implementation is
> MMapDirectory,
>
> > > which delegate the caching to the system and should have
>
> > > already optimized this situation
>
> > >
>
> >
>
> > We do use the default MMap-dir but I was actually thinking about
>
> > unpacking/walking Term-Dict data (FST) repeatedly from various
>
> > threads, even if via MMap. Are there optimizations here (caching unpacked
>
> > blocks etc..) that we could tap into?
>
> >
>
> > --
>
> > Ravi
>
> >
>
> > On Mon, May 24, 2021 at 11:09 PM Patrick Zhai <zhai7...@gmail.com>
> wrote:
>
> >
>
> > > Hi Ravi,
>
> > >
>
> > > 1. May I know what lucene version you're using? As far as I know the
>
> > > SortingMergePolicy has been deprecated and replaced by
>
> > > IndexWriterConfig.setIndexSort in newer lucene version. So if the
>
> > > "setIndexSort" is available I would suggest using that to achieve the
>
> > > sorted index (as you might have already figured out, the
> IndexRearranger
>
> > > let you pass in an IndexWriterConfig so that you could set it there).
> If
>
> > it
>
> > > is not available, I'm not sure whether the merge will happen via merge
>
> > > policy, maybe you could check the source code and see?
>
> > > 2. Yeah it's a good observation, we're doing multiple passes over one
>
> > > segment! But I think the current default directory implementation is
>
> > > MMapDirectory, which delegate the caching to the system and should have
>
> > > already optimized this situation. Here's a great blog explaining the
>
> > > MMapDirectory in lucene:
>
> > >
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> > >
>
> > > Best
>
> > > Patrick
>
> > >
>
> > > Ravikumar Govindarajan <ravikumar.govindara...@gmail.com>
> 于2021年5月24日周一
>
> > > 上午9:54写道：
>
> > >
>
> > > > Thanks Michael!
>
> > > >
>
> > > > This was just what I was looking for!!. Just a couple of questions.
>
> > > >
>
> > > >
>
> > > >    - When we call addIndexes(IndexReader...), does the merge happen
> via
>
> > > >    MergePolicy? We use a SortingMergePolicy and would like to
> maintain
>
> > > the
>
> > > >    sort-order in newly created segments too
>
> > > >    - Concurrency is a cool-trick here. But if I understand the patch
>
> > > >    correctly, don't we end-up doing multiple passes over the Term
> Dict,
>
> > > one
>
> > > >    for each Selector? Loading it fully in memory could help here,
>
> > > possibly?
>
> > > >
>
> > > > --
>
> > > > Ravi
>
> > > >
>
> > > > On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
>
> > > > luc...@mikemccandless.com> wrote:
>
> > > >
>
> > > > > Are you trying to rewrite your already created index into a
> different
>
> > > > > segment geometry?
>
> > > > >
>
> > > > > Maybe have a look at the new IndexRearranger tool
>
> > > > > <https://issues.apache.org/jira/browse/LUCENE-9694>?  It is
> already
>
> > > > doing
>
> > > > > something like what you enumerated below, including mocking
> LiveDocs
>
> > to
>
> > > > get
>
> > > > > the right documents into the right segments.
>
> > > > >
>
> > > > > Mike McCandless
>
> > > > >
>
> > > > > http://blog.mikemccandless.com
>
> > > > >
>
> > > > >
>
> > > > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
>
> > > > > ravikumar.govindara...@gmail.com> wrote:
>
> > > > >
>
> > > > >> Hello,
>
> > > > >>
>
> > > > >> We have a use-case for index-rewrite on a "frozen index" where no
>
> > new
>
> > > > >> documents are added. It goes like this..
>
> > > > >>
>
> > > > >>    1. Get all segments for the index (base-segment-list)
>
> > > > >>    2. Create a new segment from base-segment-list with unique set
> of
>
> > > > docs
>
> > > > >>    (LiveDocs)
>
> > > > >>    3. Repeat step 2, for a fixed count. Like say 5 or 10 times
>
> > > > >>
>
> > > > >> Is something like this achievable via Merge Policy? We can disable
>
> > > > commits
>
> > > > >> too, till the full run is completed.
>
> > > > >>
>
> > > > >> Any help is appreciated
>
> > > > >>
>
> > > > >> Regards,
>
> > > > >> Ravi
>
> > > > >>
>
> > > > >
>
> > > >
>
> > >
>
> >
>
>

Re: Multiple merge-runs from same set of segments

Reply via email to