Re: SortingMergePolicy for already sorted segments

Ravikumar Govindarajan Tue, 17 Jun 2014 23:05:27 -0700

>
> These lookups are expensive and will be done millions of times (each term,
> each DV field, each .. everything).



Yes, I think you have described the issue correctly. There is no way we can
achieve speed-ups without a DocMap, especially for repeated lookups/merge

 IndexWriter relies on this in order to properly apply incoming document
> deletions and field updates while the segments were merging


The need of a global DocMap is much more obvious in this case, which I had
missed.

 Perhaps it would be worth to explore a SortingMultiSortedAtomicReader
> which merge-sorts the postings and other data that way


If I understand your points correctly, a global DocMap is unavoidable for
obvious reasons but  a SlowCompositeReader can be substituted with a
SortingMultiSortedAtomicReader
which "may" result in benefits for my case.

I was actually trying to compare lucene's merges with NoSQL databases
[SSTables] that have the exact problem. But turns out lucene is a bit
special case

Thanks for taking the time and explaining. I shall try the approach you've
suggested


On Tue, Jun 17, 2014 at 10:50 PM, Shai Erera <ser...@gmail.com> wrote:

> That said... if we generate the global DocMap up front, there's no reason
> to not execute the merge of the segments more efficiently, i.e. without
> wrapping them in a SlowCompositeReaderWrapper.
>
> But that's not work for SortingMergePolicy, it's either a special
> SortingAtomicReader which wraps a group of readers + a global DocMap, and
> then merge-sorts them more efficiently than how it's done now. Or we tap
> into SegmentMerger .. which is way more complicated.
>
> Perhaps it would be worth to explore a SortingMultiSortedAtomicReader which
> merge-sorts the postings and other data that way ... I look at e.g how
> doc-values are merged .. not sure it will improve performance. But if you
> want to cons up a patch, that'd be awesome!
>
> Shai
>
>
> On Tue, Jun 17, 2014 at 8:01 PM, Shai Erera <ser...@gmail.com> wrote:
>
> > OK I think I now understand what you're asking :). It's unrelated though
> > to SortingMergePolicy. You propose to do the "merge" part of a
> merge-sort,
> > since we know the indexes are already sorted, right?
> >
> > This is something we've considered in the past, but it is very tricky
> (see
> > below) and we went with the SortingAR for simplicity and speed of coding.
> > If however you have an idea how we can easily implement that, that would
> be
> > awesome.
> >
> > So let's consider merging the posting lists of f:val from the N readers.
> > Say that each returns docs 0-3, and the merged posting will have 4*N
> > entries (say we don't have deletes). To properly merge them, you need to
> > lookup the sort-value of each document from each reader, and compare
> > according to it.
> >
> > Now you move on to f:val2 (another posting) and it wants to merge 100
> > other docs. So you need to lookup the value of each document, compare by
> > it, and merge them. And the process continues ...
> >
> > These lookups are expensive and will be done millions of times (each
> term,
> > each DV field, each .. everything).
> >
> > More than that, there's a serious issue of correctness, because you never
> > make a global sorting decision. So if f:val sees only a single document -
> > 0, in all segments, you want to map them to 4 GLOBALLY SORTED documents.
> If
> > you make a local decision based on these 4 documents, you will end up w/
> a
> > completely messed up segment.
> >
> > I think the global DocMap is really required. Forget about that that
> other
> > code, e.g. IndexWriter relies on this in order to properly apply incoming
> > document deletions and field updates while the segments were merging.
> It's
> > just a matter of correctness - we need to know the global sorted segment
> > map.
> >
> > Shai
> >
> >
> > On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan <
> > ravikumar.govindara...@gmail.com> wrote:
> >
> >> >
> >> > Therefore the DocMap is initialized only when the
> >> > merge actually executes ... what is there more to postpone?
> >>
> >>
> >> Agreed. However, what I am asking is, if there is an alternative to
> >> DocMap,
> >> will that be better? Plz read-on
> >>
> >>  And besides, if the segments are already sorted, you should return a
> >> null DocMap,
> >> > like Lucene code does ...
> >>
> >>
> >> What I am trying to say is, my individual segments are sorted. However,
> >> when a merge combines "N" individual sorted-segments, there needs to be
> a
> >> global sort-order for writing the new segment. Passing null DocMap won't
> >> work here, no?
> >>
> >> DocMap is one-way of bringing the global order during a merge. Another
> way
> >> is to use something like a MergedIterator<SegmentReader> instead of
> >> DocMap,
> >> which doesn't need any memory
> >>
> >> I was trying to get a heads-up on these 2 approaches. Please do let me
> >> know
> >> if I have understood correctly
> >>
> >> --
> >> Ravi
> >>
> >>
> >>
> >>
> >> On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera <ser...@gmail.com> wrote:
> >>
> >> > >
> >> > > I am afraid the DocMap still maintains doc-id mappings till merge
> and
> >> I
> >> > am
> >> > > trying to avoid it...
> >> > >
> >> >
> >> > What do you mean 'till merge'? The method OneMerge.getMergeReaders()
> is
> >> > called only when the merge is executed, not when the MergePolicy
> >> decided to
> >> > merge those segments. Therefore the DocMap is initialized only when
> the
> >> > merge actually executes ... what is there more to postpone?
> >> >
> >> > And besides, if the segments are already sorted, you should return a
> >> null
> >> > DocMap, like Lucene code does ...
> >> >
> >> > If I miss your point, I'd appreciate if you can point me to a code
> >> example,
> >> > preferably in Lucene source, which demonstrates the problem.
> >> >
> >> > Shai
> >> >
> >> >
> >> > On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan <
> >> > ravikumar.govindara...@gmail.com> wrote:
> >> >
> >> > > I am afraid the DocMap still maintains doc-id mappings till merge
> and
> >> I
> >> > am
> >> > > trying to avoid it...
> >> > >
> >> > > I think lucene itself has a MergeIterator in o.a.l.util package.
> >> > >
> >> > > A MergePolicy can wrap a simple MergeIterator for iterating docs
> >> across
> >> > > different AtomicReaders in correct sort-order for a given field/term
> >> > >
> >> > > That should be fine right?
> >> > >
> >> > > --
> >> > > Ravi
> >> > >
> >> > > --
> >> > > Ravi
> >> > >
> >> > >
> >> > > On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera <ser...@gmail.com>
> wrote:
> >> > >
> >> > > > loadSortTerm is your method right? In the current Sorter.sort
> >> > > > implementation, I see this code:
> >> > > >
> >> > > >     boolean sorted = true;
> >> > > >     for (int i = 1; i < maxDoc; ++i) {
> >> > > >       if (comparator.compare(i-1, i) > 0) {
> >> > > >         sorted = false;
> >> > > >         break;
> >> > > >       }
> >> > > >     }
> >> > > >     if (sorted) {
> >> > > >       return null;
> >> > > >     }
> >> > > >
> >> > > > Perhaps you can write similar code?
> >> > > >
> >> > > > Also note that the sorting interface has changed, I think in 4.8,
> >> and
> >> > now
> >> > > > you don't really need to implement a Sorter, but rather pass a
> >> > SortField,
> >> > > > if that works for you.
> >> > > >
> >> > > > Shai
> >> > > >
> >> > > >
> >> > > > On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan <
> >> > > > ravikumar.govindara...@gmail.com> wrote:
> >> > > >
> >> > > > > Shai,
> >> > > > >
> >> > > > > This is the code snippet I use inside my class...
> >> > > > >
> >> > > > > public class MySorter extends Sorter {
> >> > > > >
> >> > > > > @Override
> >> > > > >
> >> > > > > public DocMap sort(AtomicReader reader) throws IOException {
> >> > > > >
> >> > > > >   final Map<Integer, BytesRef> docVsId = loadSortTerm(reader);
> >> > > > >
> >> > > > >   final Sorter.DocComparator comparator = new
> >> Sorter.DocComparator()
> >> > {
> >> > > > >
> >> > > > >   @Override
> >> > > > >
> >> > > > >    public int compare(int docID1, int docID2) {
> >> > > > >
> >> > > > >       BytesRef v1 = docVsId.get(docID1);
> >> > > > >
> >> > > > >       BytesRef v2 = docVsId.get(docID2);
> >> > > > >
> >> > > > >        return v1.compareTo(v2);
> >> > > > >
> >> > > > >    }
> >> > > > >
> >> > > > >  };
> >> > > > >
> >> > > > >  return sort(reader.maxDoc(), comparator);
> >> > > > >
> >> > > > > }
> >> > > > > }
> >> > > > >
> >> > > > > My Problem is, the "AtomicReader" passed to Sorter.sort method
> is
> >> > > > actually
> >> > > > > a SlowCompositeReader, composed of a list of AtomicReaders each
> of
> >> > > which
> >> > > > is
> >> > > > > already sorted.
> >> > > > >
> >> > > > > I find this "loadSortTerm(compositeReader)" to be a bit heavy
> >> where
> >> > it
> >> > > > > tries to all load the doc-to-term mappings eagerly...
> >> > > > >
> >> > > > > Are there some alternatives for this?
> >> > > > >
> >> > > > > --
> >> > > > > Ravi
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera <ser...@gmail.com>
> >> > wrote:
> >> > > > >
> >> > > > > > I'm not sure that I follow ... where do you see DocMap being
> >> loaded
> >> > > up
> >> > > > > > front? Specifically, Sorter.sort may return null of the
> readers
> >> are
> >> > > > > already
> >> > > > > > sorted ... I think we already optimized for the case where the
> >> > > readers
> >> > > > > are
> >> > > > > > sorted.
> >> > > > > >
> >> > > > > > Shai
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan <
> >> > > > > > ravikumar.govindara...@gmail.com> wrote:
> >> > > > > >
> >> > > > > > > I am planning to use SortingMergePolicy where all the
> >> > > > > merge-participating
> >> > > > > > > segments are already sorted... I understand that I need to
> >> > define a
> >> > > > > > DocMap
> >> > > > > > > with old-new doc-id mappings.
> >> > > > > > >
> >> > > > > > > Is it possible to optimize the eager loading of DocMap and
> >> make
> >> > it
> >> > > > kind
> >> > > > > > of
> >> > > > > > > lazy load on-demand?
> >> > > > > > >
> >> > > > > > > Ex: Pass List<AtomicReader> to the caller and ask for next
> >> > new-old
> >> > > > doc
> >> > > > > > > mapping..
> >> > > > > > >
> >> > > > > > > Since my segments are already sorted, I could save on
> memory a
> >> > > > > little-bit
> >> > > > > > > this way, instead of loading the full DocMap upfront
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > > Ravi
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: SortingMergePolicy for already sorted segments

Reply via email to