Yeah, like you said, it looks quite tough to load entire FST in-mem. Plus we have to address concurrent access
I guess I have to first see how the current patch goes before any optimisations are done Thanks for the help! — Ravi On Thu, 27 May 2021 at 10:44 PM, Patrick Zhai <zhai7...@gmail.com> wrote: > Sorry for the delayed response, as for caching termDict data across > > threads, I do not aware of any existing lucene mechanism could do that (and > > it might be tricky since it is across threads), but maybe worth trying to > > see whether we can get some extra speed based on that! > > > > Patrick > > > > Ravikumar Govindarajan <ravikumar.govindara...@gmail.com> 于2021年5月24日周一 > > 上午11:49写道: > > > > > Thanks Patrick for the help! > > > > > > May I know what lucene version you're using? > > > > > > > > > > We are using an older version of lucene as of now (4.7.x) and I believe > the > > > FilterCodecReader of current version is akin to FilterAtomicReader & > should > > > do the job for us! > > > > > > If it is not available, I'm not sure whether the merge will happen via > > > merge > > > > policy, maybe you could check the source code and see? > > > > > > > > > > Checked & AFAIK, our old version isn't supporting it. But I guess it > should > > > be fine to wrap a SortingAtomicReader and pass it to the API. Guess, it > can > > > be done! > > > > > > But I think the current default directory implementation is > MMapDirectory, > > > > which delegate the caching to the system and should have > > > > already optimized this situation > > > > > > > > > > We do use the default MMap-dir but I was actually thinking about > > > unpacking/walking Term-Dict data (FST) repeatedly from various > > > threads, even if via MMap. Are there optimizations here (caching unpacked > > > blocks etc..) that we could tap into? > > > > > > -- > > > Ravi > > > > > > On Mon, May 24, 2021 at 11:09 PM Patrick Zhai <zhai7...@gmail.com> > wrote: > > > > > > > Hi Ravi, > > > > > > > > 1. May I know what lucene version you're using? As far as I know the > > > > SortingMergePolicy has been deprecated and replaced by > > > > IndexWriterConfig.setIndexSort in newer lucene version. So if the > > > > "setIndexSort" is available I would suggest using that to achieve the > > > > sorted index (as you might have already figured out, the > IndexRearranger > > > > let you pass in an IndexWriterConfig so that you could set it there). > If > > > it > > > > is not available, I'm not sure whether the merge will happen via merge > > > > policy, maybe you could check the source code and see? > > > > 2. Yeah it's a good observation, we're doing multiple passes over one > > > > segment! But I think the current default directory implementation is > > > > MMapDirectory, which delegate the caching to the system and should have > > > > already optimized this situation. Here's a great blog explaining the > > > > MMapDirectory in lucene: > > > > > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > > > > > > > Best > > > > Patrick > > > > > > > > Ravikumar Govindarajan <ravikumar.govindara...@gmail.com> > 于2021年5月24日周一 > > > > 上午9:54写道: > > > > > > > > > Thanks Michael! > > > > > > > > > > This was just what I was looking for!!. Just a couple of questions. > > > > > > > > > > > > > > > - When we call addIndexes(IndexReader...), does the merge happen > via > > > > > MergePolicy? We use a SortingMergePolicy and would like to > maintain > > > > the > > > > > sort-order in newly created segments too > > > > > - Concurrency is a cool-trick here. But if I understand the patch > > > > > correctly, don't we end-up doing multiple passes over the Term > Dict, > > > > one > > > > > for each Selector? Loading it fully in memory could help here, > > > > possibly? > > > > > > > > > > -- > > > > > Ravi > > > > > > > > > > On Mon, May 24, 2021 at 7:37 PM Michael McCandless < > > > > > luc...@mikemccandless.com> wrote: > > > > > > > > > > > Are you trying to rewrite your already created index into a > different > > > > > > segment geometry? > > > > > > > > > > > > Maybe have a look at the new IndexRearranger tool > > > > > > <https://issues.apache.org/jira/browse/LUCENE-9694>? It is > already > > > > > doing > > > > > > something like what you enumerated below, including mocking > LiveDocs > > > to > > > > > get > > > > > > the right documents into the right segments. > > > > > > > > > > > > Mike McCandless > > > > > > > > > > > > http://blog.mikemccandless.com > > > > > > > > > > > > > > > > > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan < > > > > > > ravikumar.govindara...@gmail.com> wrote: > > > > > > > > > > > >> Hello, > > > > > >> > > > > > >> We have a use-case for index-rewrite on a "frozen index" where no > > > new > > > > > >> documents are added. It goes like this.. > > > > > >> > > > > > >> 1. Get all segments for the index (base-segment-list) > > > > > >> 2. Create a new segment from base-segment-list with unique set > of > > > > > docs > > > > > >> (LiveDocs) > > > > > >> 3. Repeat step 2, for a fixed count. Like say 5 or 10 times > > > > > >> > > > > > >> Is something like this achievable via Merge Policy? We can disable > > > > > commits > > > > > >> too, till the full run is completed. > > > > > >> > > > > > >> Any help is appreciated > > > > > >> > > > > > >> Regards, > > > > > >> Ravi > > > > > >> > > > > > > > > > > > > > > > > > > > >