I did another test using lucene 4 trunk with default codecs. it's file is the same as lucene 2.9. the speed is almost the same as lucene 2.9
>> I think it could be the >>fact that AND query does block reads (64 doc/freqs at once) instead of >>doc-at-once? Ie, because of this, the query is efficitively scanning >>the next block of 64 docs instead of skipping to them? Our skipping >>impl is unfortunately rather costly so if skip will not skip that many >>docs it's better to scan. I agree with this explanation. for high frequency terms, the skiplist can not skip over many docs. it seems there are something need optimization. e.g. for high frequent terms, we use scanning; for low frequent terms, we use skiplist. but if we only care "bad case", we can just care high frequent terms only. >>You only use PFor for the very high freq terms in 2.9.x right? I use PFor if df is greater than 128. if not, I use VINT >>until we fix Lucene to run a single search concurrently (which we >>badly need to do). I am interested in this idea.(I have posted it before) do you have some resources such as papers or tech articles about it? I have tried but it need to modify index format dramatically and we use solr distributed search to relieve the problem of response time. so finally give it up. lucene4's index format is more flexible that it supports customed codecs and it's now on development, I think it's good time to take it into consideration that let it support multithread searching for a single query. I have a naive solution. dividing docList into many groups e.g grouping docIds by it's even or odd term1 df1=4 docList = 0 4 8 10 term1 df2=4 docList = 1 3 9 11 term2 df1=4 docList = 0 6 8 12 term2 df2=4 docList = 3 9 11 15 then we can use 2 threads to search topN docs on even group and odd group and finally merge their results into a single on just like solr distributed search. But it's better than solr distributed search. First, it's in a single process and data communication between threads is much faster than network. Second, each threads process the same number of documents.For solr distributed search, one shard may process 7 documents and another shard may 1 document Even if we can make each shard have the same document number. we can not make it uniformly for each term. e.g. shard1 has doc1 doc2 shard2 has doc3 doc4 but term1 may only occur in doc1 and doc2 while term2 may only occur in doc3 and doc4 we may modify it shard1 doc1 doc3 shard2 doc2 doc4 it's good for term1 and term2 but term3 may occur in doc1 and doc3... So I think it's fine-grained distributed in index while solr distributed search is coarse- grained. 2010/12/30 Michael McCandless <luc...@mikemccandless.com>: > On Mon, Dec 27, 2010 at 5:08 AM, Li Li <fancye...@gmail.com> wrote: >> I integrated pfor codec into lucene 2.9.3 and the search time >> comparsion is as follows: >> single term and query or query >> VINT in lucene 2.9.3 11.2 36.5 38.6 >> PFor in lucene 2.9.3 8.7 27.6 33.4 >> VINT in lucene 4 branch 10.6 26.5 35.4 >> PFor in lcuene 4 branch 8.1 22.5 30.7 >> >> My test terms are high frequncy terms because we are interested in "bad case" > > I agree it's the bad cases we should focus on in general. If a super > fast query gets somewhat slower it's "relatively harmless" (just a > "capacity" question for high volume sites) but if the bad queries get > slower it's awful (requires faster cutover to sharded architecture), > until we fix Lucene to run a single search concurrently (which we > badly need to do). > >> It seems lucene 4 branch's implementation of and query(conjuction >> query) is well optimized that even for VINT codec, it's faster than >> PFor in lucene 2.9.3. Could any one tell me what optimization is done? >> is store docIDs and freqs separately making it faster? or anything >> else? > > Actually vInt on the bulkpostings branch stores freq/doc together. Ie > the format is the same as 2.9.x's format. I think it could be the > fact that AND query does block reads (64 doc/freqs at once) instead of > doc-at-once? Ie, because of this, the query is efficitively scanning > the next block of 64 docs instead of skipping to them? Our skipping > impl is unfortunately rather costly so if skip will not skip that many > docs it's better to scan. > >> Another querstion, Is there anyone interested in integrating pfor >> codec into lucene 2.9.3 as me( we have to use lucene 2.9 and solr >> 1.4). And how do I contribute this patch? > > Realistically I don't think we can commit this to 2.9.x -- that branch > is purely bug fixes at this point. > > Still it's possible others could make use of such a patch so if it's > not too much work you may as well post it? It can lead to > improvements on the bulk postings branch too :) The more patches the > merrier! > > You only use PFor for the very high freq terms in 2.9.x right? I've > wondered if we should do the same on bulkpostings... problem is for eg > range queries, that visit all docs for all terms b/w X and Y, you want > the bulk decode even for low freq terms... > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org