Re: SortingMergePolicy for already sorted segments

Shai Erera Tue, 17 Jun 2014 10:02:58 -0700

OK I think I now understand what you're asking :). It's unrelated though to
SortingMergePolicy. You propose to do the "merge" part of a merge-sort,
since we know the indexes are already sorted, right?


This is something we've considered in the past, but it is very tricky (see
below) and we went with the SortingAR for simplicity and speed of coding.
If however you have an idea how we can easily implement that, that would be
awesome.

So let's consider merging the posting lists of f:val from the N readers.
Say that each returns docs 0-3, and the merged posting will have 4*N
entries (say we don't have deletes). To properly merge them, you need to
lookup the sort-value of each document from each reader, and compare
according to it.

Now you move on to f:val2 (another posting) and it wants to merge 100 other
docs. So you need to lookup the value of each document, compare by it, and
merge them. And the process continues ...

These lookups are expensive and will be done millions of times (each term,
each DV field, each .. everything).

More than that, there's a serious issue of correctness, because you never
make a global sorting decision. So if f:val sees only a single document -
0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. If
you make a local decision based on these 4 documents, you will end up w/ a
completely messed up segment.

I think the global DocMap is really required. Forget about that that other
code, e.g. IndexWriter relies on this in order to properly apply incoming
document deletions and field updates while the segments were merging. It's
just a matter of correctness - we need to know the global sorted segment
map.

Shai


On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan <
ravikumar.govindara...@gmail.com> wrote:

> >
> > Therefore the DocMap is initialized only when the
> > merge actually executes ... what is there more to postpone?
>
>
> Agreed. However, what I am asking is, if there is an alternative to DocMap,
> will that be better? Plz read-on
>
>  And besides, if the segments are already sorted, you should return a
> null DocMap,
> > like Lucene code does ...
>
>
> What I am trying to say is, my individual segments are sorted. However,
> when a merge combines "N" individual sorted-segments, there needs to be a
> global sort-order for writing the new segment. Passing null DocMap won't
> work here, no?
>
> DocMap is one-way of bringing the global order during a merge. Another way
> is to use something like a MergedIterator<SegmentReader> instead of DocMap,
> which doesn't need any memory
>
> I was trying to get a heads-up on these 2 approaches. Please do let me know
> if I have understood correctly
>
> --
> Ravi
>
>
>
>
> On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera <ser...@gmail.com> wrote:
>
> > >
> > > I am afraid the DocMap still maintains doc-id mappings till merge and I
> > am
> > > trying to avoid it...
> > >
> >
> > What do you mean 'till merge'? The method OneMerge.getMergeReaders() is
> > called only when the merge is executed, not when the MergePolicy decided
> to
> > merge those segments. Therefore the DocMap is initialized only when the
> > merge actually executes ... what is there more to postpone?
> >
> > And besides, if the segments are already sorted, you should return a null
> > DocMap, like Lucene code does ...
> >
> > If I miss your point, I'd appreciate if you can point me to a code
> example,
> > preferably in Lucene source, which demonstrates the problem.
> >
> > Shai
> >
> >
> > On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan <
> > ravikumar.govindara...@gmail.com> wrote:
> >
> > > I am afraid the DocMap still maintains doc-id mappings till merge and I
> > am
> > > trying to avoid it...
> > >
> > > I think lucene itself has a MergeIterator in o.a.l.util package.
> > >
> > > A MergePolicy can wrap a simple MergeIterator for iterating docs across
> > > different AtomicReaders in correct sort-order for a given field/term
> > >
> > > That should be fine right?
> > >
> > > --
> > > Ravi
> > >
> > > --
> > > Ravi
> > >
> > >
> > > On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera <ser...@gmail.com> wrote:
> > >
> > > > loadSortTerm is your method right? In the current Sorter.sort
> > > > implementation, I see this code:
> > > >
> > > >     boolean sorted = true;
> > > >     for (int i = 1; i < maxDoc; ++i) {
> > > >       if (comparator.compare(i-1, i) > 0) {
> > > >         sorted = false;
> > > >         break;
> > > >       }
> > > >     }
> > > >     if (sorted) {
> > > >       return null;
> > > >     }
> > > >
> > > > Perhaps you can write similar code?
> > > >
> > > > Also note that the sorting interface has changed, I think in 4.8, and
> > now
> > > > you don't really need to implement a Sorter, but rather pass a
> > SortField,
> > > > if that works for you.
> > > >
> > > > Shai
> > > >
> > > >
> > > > On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan <
> > > > ravikumar.govindara...@gmail.com> wrote:
> > > >
> > > > > Shai,
> > > > >
> > > > > This is the code snippet I use inside my class...
> > > > >
> > > > > public class MySorter extends Sorter {
> > > > >
> > > > > @Override
> > > > >
> > > > > public DocMap sort(AtomicReader reader) throws IOException {
> > > > >
> > > > >   final Map<Integer, BytesRef> docVsId = loadSortTerm(reader);
> > > > >
> > > > >   final Sorter.DocComparator comparator = new
> Sorter.DocComparator()
> > {
> > > > >
> > > > >   @Override
> > > > >
> > > > >    public int compare(int docID1, int docID2) {
> > > > >
> > > > >       BytesRef v1 = docVsId.get(docID1);
> > > > >
> > > > >       BytesRef v2 = docVsId.get(docID2);
> > > > >
> > > > >        return v1.compareTo(v2);
> > > > >
> > > > >    }
> > > > >
> > > > >  };
> > > > >
> > > > >  return sort(reader.maxDoc(), comparator);
> > > > >
> > > > > }
> > > > > }
> > > > >
> > > > > My Problem is, the "AtomicReader" passed to Sorter.sort method is
> > > > actually
> > > > > a SlowCompositeReader, composed of a list of AtomicReaders each of
> > > which
> > > > is
> > > > > already sorted.
> > > > >
> > > > > I find this "loadSortTerm(compositeReader)" to be a bit heavy where
> > it
> > > > > tries to all load the doc-to-term mappings eagerly...
> > > > >
> > > > > Are there some alternatives for this?
> > > > >
> > > > > --
> > > > > Ravi
> > > > >
> > > > >
> > > > > On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera <ser...@gmail.com>
> > wrote:
> > > > >
> > > > > > I'm not sure that I follow ... where do you see DocMap being
> loaded
> > > up
> > > > > > front? Specifically, Sorter.sort may return null of the readers
> are
> > > > > already
> > > > > > sorted ... I think we already optimized for the case where the
> > > readers
> > > > > are
> > > > > > sorted.
> > > > > >
> > > > > > Shai
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan <
> > > > > > ravikumar.govindara...@gmail.com> wrote:
> > > > > >
> > > > > > > I am planning to use SortingMergePolicy where all the
> > > > > merge-participating
> > > > > > > segments are already sorted... I understand that I need to
> > define a
> > > > > > DocMap
> > > > > > > with old-new doc-id mappings.
> > > > > > >
> > > > > > > Is it possible to optimize the eager loading of DocMap and make
> > it
> > > > kind
> > > > > > of
> > > > > > > lazy load on-demand?
> > > > > > >
> > > > > > > Ex: Pass List<AtomicReader> to the caller and ask for next
> > new-old
> > > > doc
> > > > > > > mapping..
> > > > > > >
> > > > > > > Since my segments are already sorted, I could save on memory a
> > > > > little-bit
> > > > > > > this way, instead of loading the full DocMap upfront
> > > > > > >
> > > > > > > --
> > > > > > > Ravi
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: SortingMergePolicy for already sorted segments

Reply via email to