I think random test is not sufficient. for normal situation, some branches are not executed. I tested http://code.google.com/p/integer-array-compress-kit/ with many random int arrays and it works. But when I use it in real indexing, when in optimize stage, it corrupted. Because PForDelta will choose best numFrameBits and some bit such as 31 is hardly generated by random arrays. So I "force" the encoder to choose all possible numFrameBits to test all the decode1 ...decode32 and find some bugs in it. what's pfor2? using s9/s16 to encode exception and offset? In http://code.google.com/p/integer-array-compress-kit/ it's s9 for NewPForDelta also have many bugs and also need test each branch to ensure it works well.
2010/12/20 Michael McCandless <luc...@mikemccandless.com>: > It's autogen'd from the Python script in that dir (gendecompress), > but, we are actively experimenting with the numerous ways to feed it > data from the file, to see what the JVM can most efficiently execute. > > For testing, we need better coverage here. But I have an initial > "encode random ints" test that I'm about to commit to the bulkpostings > branch... the pfor1 impl passes it, but pfor2 doesn't yet (I think > maybe because Simple16 can't handle ints >= 2^28?). > > Mike > > On Sun, Dec 19, 2010 at 10:06 PM, Li Li <fancye...@gmail.com> wrote: >> is ForDecompressImpl generated by codes or manully coded? >> I am frustrated by >> http://code.google.com/p/integer-array-compress-kit/ which contains >> too many bugs( I fixed more than 20 but there still existed bugs) >> Because decoder has too many branches and in normal situation, some >> branches seldom occurs . >> >> these decoder implemented in branch assume blockSize==128, it has less >> branches than decoder which support arbitrary blockSize. >> How do you test these decoder to ensure every branch is tested? >> >> 2010/12/16 Michael McCandless <luc...@mikemccandless.com>: >>> On the bulkpostings branch you can do something like this: >>> >>> CodecProvider cp = new CodecProvider(); >>> cp.register(new PatchedFrameOfRefCodec()); >>> cp.setDefaultFieldCodec("PatchedFrameOfRef"); >>> >>> Then whenever you create an IW or IR, use the advanced method that >>> accepts a CodecProvider. Then the index will always use PForDelta to >>> write/read. >>> >>> I suspect conjunction queries got faster because we no longer skip if >>> the docID we seek is already in the current buffer (currently sized >>> 64). Ie, skip is very costly when the target isn't far. This was >>> sort of an accidental byproduct of forcing even conjunction queries >>> using Standard (vInt) codec to work on block buffers, but I think it's >>> an important opto that we should more generally apply. >>> >>> Skipping for block codecs and Standard/vInt are done w/ the same class >>> now. It's just that the block codec must store the long filePointer >>> where the block starts *and* the int offset into the block, vs >>> Standard codec that just stores a filePointer. >>> >>> On "how do we implement bulk read" this is the core change on the >>> bulkpostings branch -- we have a new API to separately bulk-read >>> docDeltas, freqs, positionDeltas. But we are rapidly iterating on >>> improving this (and getting to a clean PFor/For impl) now... >>> >>> Mike >>> >>> On Thu, Dec 16, 2010 at 4:29 AM, Li Li <fancye...@gmail.com> wrote: >>>> hi Michael, >>>> lucene 4 has so much changes that I don't know how to index and >>>> search with specified codec. could you please give me some code >>>> snipplets that using PFor codec so I can trace the codes. >>>> in you blog >>>> http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html >>>> you said "The AND query, curiously, got faster; I think this is >>>> because I modified its scoring to first try seeking within the block >>>> of decoded ints". >>>> I am also curious about the result because VINT only need decode >>>> part of the doclist while PFor need decode the whole block. But I >>>> think with conjuction queries, the main time is used for searching in >>>> skiplist. I haven't read your codes yet. But I guess the skiplist for >>>> VINT and the skiplist for PFor is different. >>>> e.g. lucene 2.9's default skipInterval is 16, so it like >>>> level 1 256 >>>> level 0 16 32 48 64 80 96 112 128 ... 256 >>>> when need skipTo(60) we need read 0 16 32 48 64 in level0 >>>> but when use block, e.g. block size is 128, my implementation of >>>> skiplist is >>>> level 1 256 >>>> level 0 128 256 >>>> when skipTo(60) we only read 2 item in level0 and decode the first >>>> block which contains 128 docIDs >>>> >>>> How do you implement bulk read? >>>> I did like this: I decode a block and cache it in a int array. I >>>> think I can buffer up to 100K docIDs and tfs for disjuction queries(it >>>> cost less than 1MB memory for each term) >>>> SegmentTermDocs.read(final int[] docs, final int[] freqs) >>>> ........... >>>> while (i < length && count < df) { >>>> if (curBlockIdx >= curBlockSize) { //this >>>> condition is often >>>> false, we may optimize it. but JVM hotspots will cache "hot" codes. So >>>> ... >>>> int idBlockBytes = >>>> freqStream.readVInt(); >>>> curBlockIdx = 0; >>>> for (int k = 0; k < idBlockBytes; >>>> k++) { >>>> buffer[k] = >>>> freqStream.readInt(); >>>> } >>>> >>>> blockIds = >>>> code.decode(buffer,idBlockBytes); >>>> curBlockSize = blockIds.length; >>>> >>>> int tfBlockBytes = >>>> freqStream.readVInt(); >>>> for (int k = 0; k < tfBlockBytes; >>>> k++) { >>>> buffer[k] = >>>> freqStream.readInt(); >>>> } >>>> blockTfs = code.decode(buffer, >>>> tfBlockBytes); >>>> assert curBlockSize == >>>> decoded.length; >>>> >>>> } >>>> freq = blockTfs[curBlockIdx]; >>>> doc += blockIds[curBlockIdx++]; >>>> >>>> count++; >>>> >>>> if (deletedDocs == null || >>>> !deletedDocs.get(doc)) { >>>> docs[i] = doc; >>>> freqs[i] = freq; >>>> ++i; >>>> } >>>> } >>>> >>>> >>>> >>>> 2010/12/15 Michael McCandless <luc...@mikemccandless.com>: >>>>> Hi Li Li, >>>>> >>>>> That issue has such a big patch, and enough of us are now iterating on >>>>> it, that we cut a dedicated branch for it. >>>>> >>>>> But note that this branch is off of trunk (to be 4.0). >>>>> >>>>> You should be able to do this: >>>>> >>>>> svn checkout >>>>> https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings >>>>> >>>>> And then run things in there. I just committed FOR/PFOR prototype >>>>> codecs from LUCENE-1410 onto that branch, so eg you can run unit tests >>>>> using those codecs by running "ant test >>>>> -Dtests.codec=PatchedFrameOfRef". >>>>> >>>>> Please post patches back if you improve things! We need all the help >>>>> we can get :) >>>>> >>>>> Mike >>>>> >>>>> On Wed, Dec 15, 2010 at 5:54 AM, Li Li <fancye...@gmail.com> wrote: >>>>>> hi Michael >>>>>> you posted a patch here >>>>>> https://issues.apache.org/jira/browse/LUCENE-2723 >>>>>> I am not familiar with patch. do I need download >>>>>> LUCENE-2723.patch(there are many patches after this name, do I need >>>>>> the latest one?) and LUCENE-2723_termscorer.patch and patch them >>>>>> (patch -p1 <LUCENE-2723.patch)? I just check out the latest source >>>>>> code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene >>>>>> >>>>>> >>>>>> 2010/12/14 Michael McCandless <luc...@mikemccandless.com>: >>>>>>> Likely you are seeing the startup cost of hotspot compiling the PFOR >>>>>>> code? >>>>>>> >>>>>>> Ie, does your test first "warmup" the JRE and then do the real test? >>>>>>> >>>>>>> I've also found that running -Xbatch produces more consistent results >>>>>>> from run to run, however, those results may not be as fast as running >>>>>>> w/o -Xbatch. >>>>>>> >>>>>>> Also, it's better to test on actual data (ie a Lucene index's >>>>>>> postings), and in the full context of searching, because then we get a >>>>>>> sense of what speedups a real app will see... micro-benching is nearly >>>>>>> impossible in Java since Hotspot acts very differently vs the "real" >>>>>>> test. >>>>>>> >>>>>>> Mike >>>>>>> >>>>>>> On Tue, Dec 14, 2010 at 2:50 AM, Li Li <fancye...@gmail.com> wrote: >>>>>>>> Hi >>>>>>>> I tried to integrate PForDelta into lucene 2.9 but confronted a >>>>>>>> problem. >>>>>>>> I use the implementation in >>>>>>>> http://code.google.com/p/integer-array-compress-kit/ >>>>>>>> it implements a basic PForDelta algorithm and an improved one(which >>>>>>>> called NewPForDelta, but there are many bugs and I have fixed them), >>>>>>>> But compare it with VInt and S9, it's speed is very slow when only >>>>>>>> decode small number of integer arrays. >>>>>>>> e.g. when I decoded int[256] arrays which values are randomly >>>>>>>> generated between 0 and 100, if decode just one array. PFor(or >>>>>>>> NewPFor) is very slow. when it continuously decodes many arrays such >>>>>>>> as 10000, it's faster than s9 and vint. >>>>>>>> Another strange phenomena is that when call PFor decoder twice, the >>>>>>>> 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor >>>>>>>> is faster. reverse the call sequcence, the 2nd called decoder is >>>>>>>> faster >>>>>>>> e.g. >>>>>>>> ct.testNewPFDCodes(list); >>>>>>>> ct.testPFor(list); >>>>>>>> ct.testVInt(list); >>>>>>>> ct.testS9(list); >>>>>>>> >>>>>>>> NewPFD decode: 3614705 >>>>>>>> PForDelta decode: 17320 >>>>>>>> VINT decode: 16483 >>>>>>>> S9 decode: 19835 >>>>>>>> when I call by the following sequence >>>>>>>> >>>>>>>> ct.testPFor(list); >>>>>>>> ct.testNewPFDCodes(list); >>>>>>>> ct.testVInt(list); >>>>>>>> ct.testS9(list); >>>>>>>> >>>>>>>> PForDelta decode: 3212140 >>>>>>>> NewPFD decode: 19556 >>>>>>>> VINT decode: 16762 >>>>>>>> S9 decode: 16483 >>>>>>>> >>>>>>>> My implementation is -- group docIDs and termDocFreqs into block >>>>>>>> which contains 128 integers. when SegmentTermDocs's next method >>>>>>>> called(or read readNoTf).it decodes a block and save it to a cache. >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>>> >>>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>>> >>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: dev-h...@lucene.apache.org >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org