Re: strange problem of PForDelta decoder

Li Li Mon, 20 Dec 2010 02:49:37 -0800

   I think random test is not sufficient.
   for normal situation, some branches are not executed. I tested
http://code.google.com/p/integer-array-compress-kit/ with many random
int arrays and it works. But when I use it in real indexing, when in
optimize stage, it corrupted.
  Because PForDelta will choose best numFrameBits and some bit such as
31 is hardly generated by random arrays. So I "force" the encoder to
choose all possible numFrameBits to test all the decode1 ...decode32
and find some bugs in it.
    what's pfor2? using s9/s16 to encode exception and offset?
    In http://code.google.com/p/integer-array-compress-kit/ it's s9
for NewPForDelta also have many bugs and also need test each branch to
ensure it works well.


2010/12/20 Michael McCandless <luc...@mikemccandless.com>:
> It's autogen'd from the Python script in that dir (gendecompress),
> but, we are actively experimenting with the numerous ways to feed it
> data from the file, to see what the JVM can most efficiently execute.
>
> For testing, we need better coverage here.  But I have an initial
> "encode random ints" test that I'm about to commit to the bulkpostings
> branch... the pfor1 impl passes it, but pfor2 doesn't yet (I think
> maybe because Simple16 can't handle ints >= 2^28?).
>
> Mike
>
> On Sun, Dec 19, 2010 at 10:06 PM, Li Li <fancye...@gmail.com> wrote:
>> is ForDecompressImpl generated by codes or manully coded?
>> I am frustrated by
>> http://code.google.com/p/integer-array-compress-kit/ which contains
>> too many bugs( I fixed more than 20 but there still existed bugs)
>> Because decoder has too many branches and in normal situation, some
>> branches seldom  occurs .
>>
>> these decoder implemented in branch assume blockSize==128, it has less
>> branches than decoder which support arbitrary blockSize.
>> How do you test these decoder to ensure every branch is tested?
>>
>> 2010/12/16 Michael McCandless <luc...@mikemccandless.com>:
>>> On the bulkpostings branch you can do something like this:
>>>
>>>  CodecProvider cp = new CodecProvider();
>>>  cp.register(new PatchedFrameOfRefCodec());
>>>  cp.setDefaultFieldCodec("PatchedFrameOfRef");
>>>
>>> Then whenever you create an IW or IR, use the advanced method that
>>> accepts a CodecProvider.  Then the index will always use PForDelta to
>>> write/read.
>>>
>>> I suspect conjunction queries got faster because we no longer skip if
>>> the docID we seek is already in the current buffer (currently sized
>>> 64).  Ie, skip is very costly when the target isn't far.  This was
>>> sort of an accidental byproduct of forcing even conjunction queries
>>> using Standard (vInt) codec to work on block buffers, but I think it's
>>> an important opto that we should more generally apply.
>>>
>>> Skipping for block codecs and Standard/vInt are done w/ the same class
>>> now.  It's just that the block codec must store the long filePointer
>>> where the block starts *and* the int offset into the block, vs
>>> Standard codec that just stores a filePointer.
>>>
>>> On "how do we implement bulk read" this is the core change on the
>>> bulkpostings branch -- we have a new API to separately bulk-read
>>> docDeltas, freqs, positionDeltas.  But we are rapidly iterating on
>>> improving this (and getting to a clean PFor/For impl) now...
>>>
>>> Mike
>>>
>>> On Thu, Dec 16, 2010 at 4:29 AM, Li Li <fancye...@gmail.com> wrote:
>>>> hi Michael,
>>>>   lucene 4 has so much changes that I don't know how to index and
>>>> search with specified codec. could you please give me some code
>>>> snipplets that using PFor codec so I can trace the codes.
>>>>   in you blog 
>>>> http://chbits.blogspot.com/2010/08/lucene-performance-with-pfordelta-codec.html
>>>>   you said "The AND query, curiously, got faster; I think this is
>>>> because I modified its scoring to first try seeking within the block
>>>> of decoded ints".
>>>>   I am also curious about the result because VINT only need decode
>>>> part of the doclist while PFor need decode the whole block. But I
>>>> think with conjuction queries, the main time is used for searching in
>>>> skiplist. I haven't read your codes yet. But I guess the skiplist for
>>>> VINT and the skiplist for PFor is different.
>>>>   e.g.   lucene 2.9's default skipInterval is 16, so it like
>>>>   level 1                                               256
>>>>   level 0  16 32 48 64 80 96 112 128 ...   256
>>>>   when need skipTo(60) we need read 0 16 32 48 64 in level0
>>>>   but when use block, e.g. block size is 128, my implementation of 
>>>> skiplist is
>>>>   level 1       256
>>>>   level 0 128 256
>>>>   when skipTo(60) we only read 2 item in level0 and decode the first
>>>> block which contains 128 docIDs
>>>>
>>>>   How do you implement bulk read?
>>>>   I did like this: I decode a block and cache it in a int array. I
>>>> think I can buffer up to 100K docIDs and tfs for disjuction queries(it
>>>> cost less than 1MB memory for each term)
>>>>   SegmentTermDocs.read(final int[] docs, final int[] freqs)
>>>>           ...........
>>>>                        while (i < length && count < df) {
>>>>                                if (curBlockIdx >= curBlockSize) { //this 
>>>> condition is often
>>>> false, we may optimize it. but JVM hotspots will cache "hot" codes. So
>>>> ...
>>>>                                        int idBlockBytes = 
>>>> freqStream.readVInt();
>>>>                                        curBlockIdx = 0;
>>>>                                        for (int k = 0; k < idBlockBytes; 
>>>> k++) {
>>>>                                                buffer[k] = 
>>>> freqStream.readInt();
>>>>                                        }
>>>>
>>>>                                        blockIds = 
>>>> code.decode(buffer,idBlockBytes);
>>>>                                        curBlockSize = blockIds.length;
>>>>
>>>>                                        int tfBlockBytes = 
>>>> freqStream.readVInt();
>>>>                                        for (int k = 0; k < tfBlockBytes; 
>>>> k++) {
>>>>                                                buffer[k] = 
>>>> freqStream.readInt();
>>>>                                        }
>>>>                                        blockTfs = code.decode(buffer, 
>>>> tfBlockBytes);
>>>>                                        assert curBlockSize == 
>>>> decoded.length;
>>>>
>>>>                                }
>>>>                                freq = blockTfs[curBlockIdx];
>>>>                                doc += blockIds[curBlockIdx++];
>>>>
>>>>                                count++;
>>>>
>>>>                                if (deletedDocs == null || 
>>>> !deletedDocs.get(doc)) {
>>>>                                        docs[i] = doc;
>>>>                                        freqs[i] = freq;
>>>>                                        ++i;
>>>>                                }
>>>>                        }
>>>>
>>>>
>>>>
>>>> 2010/12/15 Michael McCandless <luc...@mikemccandless.com>:
>>>>> Hi Li Li,
>>>>>
>>>>> That issue has such a big patch, and enough of us are now iterating on
>>>>> it, that we cut a dedicated branch for it.
>>>>>
>>>>> But note that this branch is off of trunk (to be 4.0).
>>>>>
>>>>> You should be able to do this:
>>>>>
>>>>>  svn checkout 
>>>>> https://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings
>>>>>
>>>>> And then run things in there.  I just committed FOR/PFOR prototype
>>>>> codecs from LUCENE-1410 onto that branch, so eg you can run unit tests
>>>>> using those codecs by running "ant test
>>>>> -Dtests.codec=PatchedFrameOfRef".
>>>>>
>>>>> Please post patches back if you improve things!  We need all the help
>>>>> we can get :)
>>>>>
>>>>> Mike
>>>>>
>>>>> On Wed, Dec 15, 2010 at 5:54 AM, Li Li <fancye...@gmail.com> wrote:
>>>>>> hi Michael
>>>>>>    you posted a patch here 
>>>>>> https://issues.apache.org/jira/browse/LUCENE-2723
>>>>>>    I am not familiar with patch. do I need download
>>>>>> LUCENE-2723.patch(there are many patches after this name, do I need
>>>>>> the latest one?) and LUCENE-2723_termscorer.patch and patch them
>>>>>> (patch -p1 <LUCENE-2723.patch)? I just check out the latest source
>>>>>> code from http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene
>>>>>>
>>>>>>
>>>>>> 2010/12/14 Michael McCandless <luc...@mikemccandless.com>:
>>>>>>> Likely you are seeing the startup cost of hotspot compiling the PFOR 
>>>>>>> code?
>>>>>>>
>>>>>>> Ie, does your test first "warmup" the JRE and then do the real test?
>>>>>>>
>>>>>>> I've also found that running -Xbatch produces more consistent results
>>>>>>> from run to run, however, those results may not be as fast as running
>>>>>>> w/o -Xbatch.
>>>>>>>
>>>>>>> Also, it's better to test on actual data (ie a Lucene index's
>>>>>>> postings), and in the full context of searching, because then we get a
>>>>>>> sense of what speedups a real app will see... micro-benching is nearly
>>>>>>> impossible in Java since Hotspot acts very differently vs the "real"
>>>>>>> test.
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>> On Tue, Dec 14, 2010 at 2:50 AM, Li Li <fancye...@gmail.com> wrote:
>>>>>>>> Hi
>>>>>>>>   I tried to integrate PForDelta into lucene 2.9 but confronted a 
>>>>>>>> problem.
>>>>>>>>   I use the implementation in
>>>>>>>> http://code.google.com/p/integer-array-compress-kit/
>>>>>>>>   it implements a basic PForDelta algorithm and an improved one(which
>>>>>>>> called NewPForDelta, but there are many bugs and I have fixed them),
>>>>>>>>   But compare it with VInt and S9, it's speed is very slow when only
>>>>>>>> decode small number of integer arrays.
>>>>>>>>   e.g. when I decoded int[256] arrays which values are randomly
>>>>>>>> generated between 0 and 100, if decode just one array. PFor(or
>>>>>>>> NewPFor) is very slow. when it continuously decodes many arrays such
>>>>>>>> as 10000, it's faster than s9 and vint.
>>>>>>>>   Another strange phenomena is that when call PFor decoder twice, the
>>>>>>>> 2nd times it's faster. Or I call PFor first then NewPFor, the NewPFor
>>>>>>>> is faster. reverse the call sequcence, the 2nd called decoder is
>>>>>>>> faster
>>>>>>>>   e.g.
>>>>>>>>                ct.testNewPFDCodes(list);
>>>>>>>>                ct.testPFor(list);
>>>>>>>>                ct.testVInt(list);
>>>>>>>>                ct.testS9(list);
>>>>>>>>
>>>>>>>> NewPFD decode: 3614705
>>>>>>>> PForDelta decode: 17320
>>>>>>>> VINT decode: 16483
>>>>>>>> S9 decode: 19835
>>>>>>>> when I call by the following sequence
>>>>>>>>
>>>>>>>>                ct.testPFor(list);
>>>>>>>>                ct.testNewPFDCodes(list);
>>>>>>>>                ct.testVInt(list);
>>>>>>>>                ct.testS9(list);
>>>>>>>>
>>>>>>>> PForDelta decode: 3212140
>>>>>>>> NewPFD decode: 19556
>>>>>>>> VINT decode: 16762
>>>>>>>> S9 decode: 16483
>>>>>>>>
>>>>>>>>   My implementation is -- group docIDs and termDocFreqs into block
>>>>>>>> which contains 128 integers. when SegmentTermDocs's next method
>>>>>>>> called(or read readNoTf).it decodes a block and save it to a cache.
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: strange problem of PForDelta decoder

Reply via email to