Thanks for the comments… My only > concern about using the FixedBitSet is that it would make sorting each > postings list run in O(maxDoc) but maybe we can make it better by > using SparseFixedBitSet
Yes I was also thinking about this. But we are on 4.x and did not take the plunge. But as you said, it should be a good idea to test on SFBS I'm curious if you already performed any kind of benchmarking of this > approach? Yes we did a stress test of sorts aimed at SortingMergePolicy. We made most of our data as RAM resident and then CPU hot-spots came up... There were few take-aways from the test. I am listing down few of them.. It's kind of lengthy. Please read through... a) Postings-List issue, as discussed above… b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk. Our segments are already sorted before participating in a merge. On mostly linear merge, we ended up decoding the same chunk again and again. Simply storing the last chunk resulted in good speed-ups for us... c) Once above steps were corrected, the CPU hotspot shifted to BlockDocsEnum. Here most of our postings-list < 128 docs. So Lucene41Postings started using vInts… I did try ForUtil encoding even for < 128 docs. It definitely went easy on CPU. But failed to measure resulting file-size increase. I realised not just SMP but any other merge must face the same issue and left it at that.. -- Ravi On Mon, Apr 20, 2015 at 12:56 PM, Adrien Grand <jpou...@gmail.com> wrote: > I like these ideas, the int[] we are using today are wasteful. My only > concern about using the FixedBitSet is that it would make sorting each > postings list run in O(maxDoc) but maybe we can make it better by > using SparseFixedBitSet (added in 5.0, given your code snippets I > assume you are still on 4.x)? > > I'm curious if you already performed any kind of benchmarking of this > approach? > > > On Tue, Apr 14, 2015 at 2:07 PM, Ravikumar Govindarajan > <ravikumar.govindara...@gmail.com> wrote: > > We were experimenting with SortingMergePolicy and came across an > alternate > > solution to TimSort of postings-list using FBS & GrowableWriter. > > > > I have attached relevant code-snippet. It would be nice if someone can > > clarify whether it is a good idea to implement... > > > > public class SortingAtomicReader { > > … > > … > > class SortingDocsEnum { > > > > //Last 2 variables namely *newdoclist* & *olddocToFreq* are added in > > //constructor. It is assumed that these 2 variables are init during > > //merge start & they are then re-used till merge completes... > > > > > > public SortingDocsEnum(int maxDoc, final DocsEnum in, boolean withFreqs, > > final Sorter.DocMap docMap, FixedBitSet newdoclist, GrowableWriter > > olddocToFreq) throws IOException { > > > > …. > > > > … > > > > while (true) { > > > > //Instead of Tim-Sorting as in existing code > > > > doc = in.nextDoc(); > > > > int newdoc = docMap.oldToNew(doc); > > > > newdoclist.set(newdoc); > > > > if(withFreqs) { > > > > olddocToFreq.set(doc, in.freq()); > > > > } > > > > } > > > > > > @Override > > > > public int nextDoc() throws IOException { > > > > if (++docIt >= upto) { > > > > return NO_MORE_DOCS; > > > > } > > > > currDoc = newdoclist.nextSetBit(++currDoc); > > > > if(currDoc == -1) { > > > > return NO_MORE_DOCS; > > > > } > > > > //clear the set-bit here before returning... > > > > newdoclist.clear(currDoc); > > > > return currDoc; > > > > } > > > > > > @Override > > > > public int freq() throws IOException { > > > > if(withFreqs && docIt < upto) { > > > > return (int)olddocToFreq.getMutable() > > > > .get(docMap.newToOld(currDoc)); > > > > } > > > > return 1; > > > > } > > > > } > > > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >