> > These lookups are expensive and will be done millions of times (each term, > each DV field, each .. everything).
Yes, I think you have described the issue correctly. There is no way we can achieve speed-ups without a DocMap, especially for repeated lookups/merge IndexWriter relies on this in order to properly apply incoming document > deletions and field updates while the segments were merging The need of a global DocMap is much more obvious in this case, which I had missed. Perhaps it would be worth to explore a SortingMultiSortedAtomicReader > which merge-sorts the postings and other data that way If I understand your points correctly, a global DocMap is unavoidable for obvious reasons but a SlowCompositeReader can be substituted with a SortingMultiSortedAtomicReader which "may" result in benefits for my case. I was actually trying to compare lucene's merges with NoSQL databases [SSTables] that have the exact problem. But turns out lucene is a bit special case Thanks for taking the time and explaining. I shall try the approach you've suggested On Tue, Jun 17, 2014 at 10:50 PM, Shai Erera <ser...@gmail.com> wrote: > That said... if we generate the global DocMap up front, there's no reason > to not execute the merge of the segments more efficiently, i.e. without > wrapping them in a SlowCompositeReaderWrapper. > > But that's not work for SortingMergePolicy, it's either a special > SortingAtomicReader which wraps a group of readers + a global DocMap, and > then merge-sorts them more efficiently than how it's done now. Or we tap > into SegmentMerger .. which is way more complicated. > > Perhaps it would be worth to explore a SortingMultiSortedAtomicReader which > merge-sorts the postings and other data that way ... I look at e.g how > doc-values are merged .. not sure it will improve performance. But if you > want to cons up a patch, that'd be awesome! > > Shai > > > On Tue, Jun 17, 2014 at 8:01 PM, Shai Erera <ser...@gmail.com> wrote: > > > OK I think I now understand what you're asking :). It's unrelated though > > to SortingMergePolicy. You propose to do the "merge" part of a > merge-sort, > > since we know the indexes are already sorted, right? > > > > This is something we've considered in the past, but it is very tricky > (see > > below) and we went with the SortingAR for simplicity and speed of coding. > > If however you have an idea how we can easily implement that, that would > be > > awesome. > > > > So let's consider merging the posting lists of f:val from the N readers. > > Say that each returns docs 0-3, and the merged posting will have 4*N > > entries (say we don't have deletes). To properly merge them, you need to > > lookup the sort-value of each document from each reader, and compare > > according to it. > > > > Now you move on to f:val2 (another posting) and it wants to merge 100 > > other docs. So you need to lookup the value of each document, compare by > > it, and merge them. And the process continues ... > > > > These lookups are expensive and will be done millions of times (each > term, > > each DV field, each .. everything). > > > > More than that, there's a serious issue of correctness, because you never > > make a global sorting decision. So if f:val sees only a single document - > > 0, in all segments, you want to map them to 4 GLOBALLY SORTED documents. > If > > you make a local decision based on these 4 documents, you will end up w/ > a > > completely messed up segment. > > > > I think the global DocMap is really required. Forget about that that > other > > code, e.g. IndexWriter relies on this in order to properly apply incoming > > document deletions and field updates while the segments were merging. > It's > > just a matter of correctness - we need to know the global sorted segment > > map. > > > > Shai > > > > > > On Tue, Jun 17, 2014 at 3:41 PM, Ravikumar Govindarajan < > > ravikumar.govindara...@gmail.com> wrote: > > > >> > > >> > Therefore the DocMap is initialized only when the > >> > merge actually executes ... what is there more to postpone? > >> > >> > >> Agreed. However, what I am asking is, if there is an alternative to > >> DocMap, > >> will that be better? Plz read-on > >> > >> And besides, if the segments are already sorted, you should return a > >> null DocMap, > >> > like Lucene code does ... > >> > >> > >> What I am trying to say is, my individual segments are sorted. However, > >> when a merge combines "N" individual sorted-segments, there needs to be > a > >> global sort-order for writing the new segment. Passing null DocMap won't > >> work here, no? > >> > >> DocMap is one-way of bringing the global order during a merge. Another > way > >> is to use something like a MergedIterator<SegmentReader> instead of > >> DocMap, > >> which doesn't need any memory > >> > >> I was trying to get a heads-up on these 2 approaches. Please do let me > >> know > >> if I have understood correctly > >> > >> -- > >> Ravi > >> > >> > >> > >> > >> On Tue, Jun 17, 2014 at 5:42 PM, Shai Erera <ser...@gmail.com> wrote: > >> > >> > > > >> > > I am afraid the DocMap still maintains doc-id mappings till merge > and > >> I > >> > am > >> > > trying to avoid it... > >> > > > >> > > >> > What do you mean 'till merge'? The method OneMerge.getMergeReaders() > is > >> > called only when the merge is executed, not when the MergePolicy > >> decided to > >> > merge those segments. Therefore the DocMap is initialized only when > the > >> > merge actually executes ... what is there more to postpone? > >> > > >> > And besides, if the segments are already sorted, you should return a > >> null > >> > DocMap, like Lucene code does ... > >> > > >> > If I miss your point, I'd appreciate if you can point me to a code > >> example, > >> > preferably in Lucene source, which demonstrates the problem. > >> > > >> > Shai > >> > > >> > > >> > On Tue, Jun 17, 2014 at 3:03 PM, Ravikumar Govindarajan < > >> > ravikumar.govindara...@gmail.com> wrote: > >> > > >> > > I am afraid the DocMap still maintains doc-id mappings till merge > and > >> I > >> > am > >> > > trying to avoid it... > >> > > > >> > > I think lucene itself has a MergeIterator in o.a.l.util package. > >> > > > >> > > A MergePolicy can wrap a simple MergeIterator for iterating docs > >> across > >> > > different AtomicReaders in correct sort-order for a given field/term > >> > > > >> > > That should be fine right? > >> > > > >> > > -- > >> > > Ravi > >> > > > >> > > -- > >> > > Ravi > >> > > > >> > > > >> > > On Tue, Jun 17, 2014 at 1:24 PM, Shai Erera <ser...@gmail.com> > wrote: > >> > > > >> > > > loadSortTerm is your method right? In the current Sorter.sort > >> > > > implementation, I see this code: > >> > > > > >> > > > boolean sorted = true; > >> > > > for (int i = 1; i < maxDoc; ++i) { > >> > > > if (comparator.compare(i-1, i) > 0) { > >> > > > sorted = false; > >> > > > break; > >> > > > } > >> > > > } > >> > > > if (sorted) { > >> > > > return null; > >> > > > } > >> > > > > >> > > > Perhaps you can write similar code? > >> > > > > >> > > > Also note that the sorting interface has changed, I think in 4.8, > >> and > >> > now > >> > > > you don't really need to implement a Sorter, but rather pass a > >> > SortField, > >> > > > if that works for you. > >> > > > > >> > > > Shai > >> > > > > >> > > > > >> > > > On Tue, Jun 17, 2014 at 9:41 AM, Ravikumar Govindarajan < > >> > > > ravikumar.govindara...@gmail.com> wrote: > >> > > > > >> > > > > Shai, > >> > > > > > >> > > > > This is the code snippet I use inside my class... > >> > > > > > >> > > > > public class MySorter extends Sorter { > >> > > > > > >> > > > > @Override > >> > > > > > >> > > > > public DocMap sort(AtomicReader reader) throws IOException { > >> > > > > > >> > > > > final Map<Integer, BytesRef> docVsId = loadSortTerm(reader); > >> > > > > > >> > > > > final Sorter.DocComparator comparator = new > >> Sorter.DocComparator() > >> > { > >> > > > > > >> > > > > @Override > >> > > > > > >> > > > > public int compare(int docID1, int docID2) { > >> > > > > > >> > > > > BytesRef v1 = docVsId.get(docID1); > >> > > > > > >> > > > > BytesRef v2 = docVsId.get(docID2); > >> > > > > > >> > > > > return v1.compareTo(v2); > >> > > > > > >> > > > > } > >> > > > > > >> > > > > }; > >> > > > > > >> > > > > return sort(reader.maxDoc(), comparator); > >> > > > > > >> > > > > } > >> > > > > } > >> > > > > > >> > > > > My Problem is, the "AtomicReader" passed to Sorter.sort method > is > >> > > > actually > >> > > > > a SlowCompositeReader, composed of a list of AtomicReaders each > of > >> > > which > >> > > > is > >> > > > > already sorted. > >> > > > > > >> > > > > I find this "loadSortTerm(compositeReader)" to be a bit heavy > >> where > >> > it > >> > > > > tries to all load the doc-to-term mappings eagerly... > >> > > > > > >> > > > > Are there some alternatives for this? > >> > > > > > >> > > > > -- > >> > > > > Ravi > >> > > > > > >> > > > > > >> > > > > On Tue, Jun 17, 2014 at 10:58 AM, Shai Erera <ser...@gmail.com> > >> > wrote: > >> > > > > > >> > > > > > I'm not sure that I follow ... where do you see DocMap being > >> loaded > >> > > up > >> > > > > > front? Specifically, Sorter.sort may return null of the > readers > >> are > >> > > > > already > >> > > > > > sorted ... I think we already optimized for the case where the > >> > > readers > >> > > > > are > >> > > > > > sorted. > >> > > > > > > >> > > > > > Shai > >> > > > > > > >> > > > > > > >> > > > > > On Tue, Jun 17, 2014 at 4:04 AM, Ravikumar Govindarajan < > >> > > > > > ravikumar.govindara...@gmail.com> wrote: > >> > > > > > > >> > > > > > > I am planning to use SortingMergePolicy where all the > >> > > > > merge-participating > >> > > > > > > segments are already sorted... I understand that I need to > >> > define a > >> > > > > > DocMap > >> > > > > > > with old-new doc-id mappings. > >> > > > > > > > >> > > > > > > Is it possible to optimize the eager loading of DocMap and > >> make > >> > it > >> > > > kind > >> > > > > > of > >> > > > > > > lazy load on-demand? > >> > > > > > > > >> > > > > > > Ex: Pass List<AtomicReader> to the caller and ask for next > >> > new-old > >> > > > doc > >> > > > > > > mapping.. > >> > > > > > > > >> > > > > > > Since my segments are already sorted, I could save on > memory a > >> > > > > little-bit > >> > > > > > > this way, instead of loading the full DocMap upfront > >> > > > > > > > >> > > > > > > -- > >> > > > > > > Ravi > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > > > >