Re: SortingAtomicReader alternate to Tim-Sort...

Ravikumar Govindarajan Tue, 21 Apr 2015 01:27:07 -0700

Thanks for the comments…

My only
> concern about using the FixedBitSet is that it would make sorting each
> postings list run in O(maxDoc) but maybe we can make it better by
> using SparseFixedBitSet

Yes I was also thinking about this. But we are on 4.x and did not take the
plunge. But as you said, it should be a good idea to test on SFBS

I'm curious if you already performed any kind of benchmarking of this
> approach?

Yes we did a stress test of sorts aimed at SortingMergePolicy. We made most
of our data as RAM resident and then CPU hot-spots came up...

There were few take-aways from the test. I am listing down few of them..
It's kind of lengthy. Please read through...

a) Postings-List issue, as discussed above…

b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk.
Our segments are already sorted before participating in a merge. On mostly
linear merge, we ended up decoding the same chunk again and again. Simply
storing the last chunk resulted in good speed-ups for us...

c) Once above steps were corrected, the CPU hotspot shifted to
BlockDocsEnum. Here most of our postings-list < 128 docs. So
Lucene41Postings started using vInts…  I did try ForUtil encoding even for
< 128 docs. It definitely went easy on CPU. But failed to measure resulting
file-size increase.

I realised not just SMP but any other merge must face the same issue and
left it at that..

--
Ravi

On Mon, Apr 20, 2015 at 12:56 PM, Adrien Grand <[email protected]> wrote:

> I like these ideas, the int[] we are using today are wasteful. My only
> concern about using the FixedBitSet is that it would make sorting each
> postings list run in O(maxDoc) but maybe we can make it better by
> using SparseFixedBitSet (added in 5.0, given your code snippets I
> assume you are still on 4.x)?
>
> I'm curious if you already performed any kind of benchmarking of this
> approach?
>
>
> On Tue, Apr 14, 2015 at 2:07 PM, Ravikumar Govindarajan
> <[email protected]> wrote:
> > We were experimenting with SortingMergePolicy and came across an
> alternate
> > solution to TimSort of postings-list using FBS & GrowableWriter.
> >
> > I have attached relevant code-snippet. It would be nice if someone can
> > clarify whether it is a good idea to implement...
> >
> > public class SortingAtomicReader {
> > …
> > …
> > class SortingDocsEnum {
> >
> > //Last 2 variables namely *newdoclist* & *olddocToFreq* are added in
> > //constructor. It is assumed that these 2 variables are init during
> > //merge start & they are then re-used till merge completes...
> >
> >
> > public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs,
> > final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter
> > olddocToFreq) throws IOException {
> >
> > ….
> >
> > …
> >
> > while (true) {
> >
> >   //Instead of Tim-Sorting as in existing code
> >
> >   doc = in.nextDoc();
> >
> >   int newdoc = docMap.oldToNew(doc);
> >
> >   newdoclist.set(newdoc);
> >
> >   if(withFreqs) {
> >
> >     olddocToFreq.set(doc, in.freq());
> >
> >   }
> >
> > }
> >
> >
> > @Override
> >
> > public int nextDoc() throws IOException {
> >
> >   if (++docIt >= upto) {
> >
> >   return NO_MORE_DOCS;
> >
> >   }
> >
> >   currDoc = newdoclist.nextSetBit(++currDoc);
> >
> >   if(currDoc == -1) {
> >
> >     return NO_MORE_DOCS;
> >
> >   }
> >
> >   //clear the set-bit here before returning...
> >
> >   newdoclist.clear(currDoc);
> >
> >   return currDoc;
> >
> > }
> >
> >
> > @Override
> >
> > public int freq() throws IOException {
> >
> >   if(withFreqs && docIt < upto) {
> >
> >   return (int)olddocToFreq.getMutable()
> >
> >                  .get(docMap.newToOld(currDoc));
> >
> >   }
> >
> >   return 1;
> >
> > }
> >
> > }
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: SortingAtomicReader alternate to Tim-Sort...

Reply via email to