hi Michael, lucene 4 has so much changes that I don't know how to index and search with specified codec. could you please give me some code snipplets that using PFor codec so I can trace the codes. in you blog http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html you said "The AND query, curiously, got faster; I think this is because I modified its scoring to first try seeking within the block of decoded ints". I am also curious about the result because VINT only need decode part of the doclist while PFor need decode the whole block. But I think with conjuction queries, the main time is used for searching in skiplist. I haven't read your codes yet. But I guess the skiplist for VINT and the skiplist for PFor is different. e.g. lucene 2.9's default skipInterval is 16, so it like level 1 256 level 0 16 32 48 64 80 96 112 128 ... 256 when need skipTo(60) we need read 0 16 32 48 64 in level0 but when use block, e.g. block size is 128, my implementation of skiplist is level 1 256 level 0 128 256 when skipTo(60) we only read 2 item in level0 and decode the first block which contains 128 docIDs
How do you implement bulk read? I did like this: I decode a block and cache it in a int array. I think I can buffer up to 100K docIDs and tfs for disjuction queries(it cost less than 1MB memory for each term) SegmentTermDocs.read(final int[] docs, final int[] freqs) ........... while (i < length && count < df) { if (curBlockIdx >= curBlockSize) { //this condition is often false, we may optimize it. but JVM hotspots will cache "hot" codes. So ... int idBlockBytes = freqStream.readVInt(); curBlockIdx = 0; for (int k = 0; k < idBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockIds = code.decode(buffer,idBlockBytes); curBlockSize = blockIds.length; int tfBlockBytes = freqStream.readVInt(); for (int k = 0; k < tfBlockBytes; k++) { buffer[k] = freqStream.readInt(); } blockTfs = code.decode(buffer, tfBlockBytes); assert curBlockSize == decoded.length; } freq = blockTfs[curBlockIdx]; doc += blockIds[curBlockIdx++]; count++; if (deletedDocs == null || !deletedDocs.get(doc)) { docs[i] = doc; freqs[i] = freq; ++i; } } 2010/12/15 Michael McCandless <luc...@mikemccandless.com>: > Hi Li Li, > > That issue has such a big patch, and enough of us are now iterating on > it, that we cut a dedicated branch for it. > > But note that this branch is off of trunk (to be 4.0). > > You should be able to do this: > > svn checkout > https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings > > And then run things in there. I just committed FOR/PFOR prototype > codecs from LUCENE-1410 onto that branch, so eg you can run unit tests > using those codecs by running "ant test > -Dtests.codec=PatchedFrameOfRef". > > Please post patches back if you improve things! We need all the help > we can get :) > > Mike > > On Wed, Dec 15, 2010 at 5:54 AM, Li Li <fancye...@gmail.com> wrote: >> hi Michael >> you posted a patch here https://issues.apache.org/jira/browse/LUCENE-2723 >> I am not familiar with patch. do I need download >> LUCENE-2723.patch(there are many patches after this name, do I need >> the latest one?) and LUCENE-2723_termscorer.patch and patch them >> (patch -p1 <LUCENE-2723.patch)? I just check out the latest source >> code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene >> >> >> 2010/12/14 Michael McCandless <luc...@mikemccandless.com>: >>> Likely you are seeing the startup cost of hotspot compiling the PFOR code? >>> >>> Ie, does your test first "warmup" the JRE and then do the real test? >>> >>> I've also found that running -Xbatch produces more consistent results >>> from run to run, however, those results may not be as fast as running >>> w/o -Xbatch. >>> >>> Also, it's better to test on actual data (ie a Lucene index's >>> postings), and in the full context of searching, because then we get a >>> sense of what speedups a real app will see... micro-benching is nearly >>> impossible in Java since Hotspot acts very differently vs the "real" >>> test. >>> >>> Mike >>> >>> On Tue, Dec 14, 2010 at 2:50 AM, Li Li <fancye...@gmail.com> wrote: >>>> Hi >>>> I tried to integrate PForDelta into lucene 2.9 but confronted a problem. >>>> I use the implementation in >>>> http://code.google.com/p/integer-array-compress-kit/ >>>> it implements a basic PForDelta algorithm and an improved one(which >>>> called NewPForDelta, but there are many bugs and I have fixed them), >>>> But compare it with VInt and S9, it's speed is very slow when only >>>> decode small number of integer arrays. >>>> e.g. when I decoded int[256] arrays which values are randomly >>>> generated between 0 and 100, if decode just one array. PFor(or >>>> NewPFor) is very slow. when it continuously decodes many arrays such >>>> as 10000, it's faster than s9 and vint. >>>> Another strange phenomena is that when call PFor decoder twice, the >>>> 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor >>>> is faster. reverse the call sequcence, the 2nd called decoder is >>>> faster >>>> e.g. >>>> ct.testNewPFDCodes(list); >>>> ct.testPFor(list); >>>> ct.testVInt(list); >>>> ct.testS9(list); >>>> >>>> NewPFD decode: 3614705 >>>> PForDelta decode: 17320 >>>> VINT decode: 16483 >>>> S9 decode: 19835 >>>> when I call by the following sequence >>>> >>>> ct.testPFor(list); >>>> ct.testNewPFDCodes(list); >>>> ct.testVInt(list); >>>> ct.testS9(list); >>>> >>>> PForDelta decode: 3212140 >>>> NewPFD decode: 19556 >>>> VINT decode: 16762 >>>> S9 decode: 16483 >>>> >>>> My implementation is -- group docIDs and termDocFreqs into block >>>> which contains 128 integers. when SegmentTermDocs's next method >>>> called(or read readNoTf).it decodes a block and save it to a cache. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org