Re: strange problem of PForDelta decoder

Li Li Thu, 30 Dec 2010 08:37:34 -0800

I did another test using lucene 4 trunk with default codecs. it's file
is the same as lucene 2.9.
the speed is almost the same as lucene 2.9


>> I think it could be the
>>fact that AND query does block reads (64 doc/freqs at once) instead of
>>doc-at-once?  Ie, because of this, the query is efficitively scanning
>>the next block of 64 docs instead of skipping to them?  Our skipping
>>impl is unfortunately rather costly so if skip will not skip that many
>>docs it's better to scan.
I agree with this explanation. for high frequency terms, the skiplist can
not skip over many docs. it seems there are something need optimization.
e.g. for high frequent terms, we use scanning; for low frequent terms, we
use skiplist. but if we only care "bad case", we can just care high frequent
terms only.

>>You only use PFor for the very high freq terms in 2.9.x right?
I use PFor if df is greater than 128. if not, I use VINT

>>until we fix Lucene to run a single search concurrently (which we
>>badly need to do).
I am interested in this idea.(I have posted it before) do you have some
resources such as papers or tech articles about it?
I have tried but it need to modify index format dramatically and we use
solr distributed search to relieve the problem of response time. so finally
give it up.
lucene4's index format is more flexible that it supports customed codecs
and it's now on development, I think it's good time to take it into
consideration
that let it support multithread searching for a single query.
I have a naive solution. dividing docList into many groups
e.g grouping docIds by it's even or odd
term1 df1=4  docList =  0  4  8  10
term1 df2=4  docList = 1  3  9  11

term2 df1=4  docList = 0  6  8  12
term2 df2=4  docList = 3  9  11 15
   then we can use 2 threads to search topN docs on even group and odd group
and finally merge their results into a single on just like solr
distributed search.
But it's better than solr distributed search.
   First, it's in a single process and data communication between
threads is much
faster than network.
   Second, each threads process the same number of documents.For solr
distributed
search, one shard may process 7 documents and another shard may 1 document
Even if we can make each shard have the same document number. we can not
make it uniformly for each term.
    e.g. shard1 has doc1 doc2
           shard2 has doc3 doc4
    but term1 may only occur in doc1 and doc2
    while term2 may only occur in doc3 and doc4
    we may modify it
           shard1 doc1 doc3
           shard2 doc2 doc4
    it's good for term1 and term2
    but term3 may occur in doc1 and doc3...
    So I think it's fine-grained distributed in index while solr
distributed search is coarse-
grained.







2010/12/30 Michael McCandless <luc...@mikemccandless.com>:
> On Mon, Dec 27, 2010 at 5:08 AM, Li Li <fancye...@gmail.com> wrote:
>> I integrated pfor codec into lucene 2.9.3 and the search time
>> comparsion is as follows:
>>                                   single term   and query   or query
>> VINT in lucene 2.9.3         11.2            36.5           38.6
>> PFor in lucene 2.9.3         8.7              27.6           33.4
>> VINT in lucene 4 branch   10.6             26.5           35.4
>> PFor in lcuene 4 branch    8.1              22.5           30.7
>>
>> My test terms are high frequncy terms because we are interested in "bad case"
>
> I agree it's the bad cases we should focus on in general.  If a super
> fast query gets somewhat slower it's "relatively harmless" (just a
> "capacity" question for high volume sites) but if the bad queries get
> slower it's awful (requires faster cutover to sharded architecture),
> until we fix Lucene to run a single search concurrently (which we
> badly need to do).
>
>> It seems lucene 4 branch's implementation of and query(conjuction
>> query) is well optimized that even for VINT codec, it's faster than
>> PFor in lucene 2.9.3. Could any one tell me what optimization is done?
>> is store docIDs and freqs separately making it faster? or anything
>> else?
>
> Actually vInt on the bulkpostings branch stores freq/doc together.  Ie
> the format is the same as 2.9.x's format.  I think it could be the
> fact that AND query does block reads (64 doc/freqs at once) instead of
> doc-at-once?  Ie, because of this, the query is efficitively scanning
> the next block of 64 docs instead of skipping to them?  Our skipping
> impl is unfortunately rather costly so if skip will not skip that many
> docs it's better to scan.
>
>> Another querstion, Is there anyone interested in integrating pfor
>> codec into lucene 2.9.3 as me( we have to use lucene 2.9 and solr
>> 1.4). And how do I contribute this patch?
>
> Realistically I don't think we can commit this to 2.9.x -- that branch
> is purely bug fixes at this point.
>
> Still it's possible others could make use of such a patch so if it's
> not too much work you may as well post it?  It can lead to
> improvements on the bulk postings branch too :)  The more patches the
> merrier!
>
> You only use PFor for the very high freq terms in 2.9.x right?  I've
> wondered if we should do the same on bulkpostings... problem is for eg
> range queries, that visit all docs for all terms b/w X and Y, you want
> the bulk decode even for low freq terms...
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: strange problem of PForDelta decoder

Reply via email to